Or a short story about how we create and train new machine learning algorithms for social media data analysis.
At YouScan, we utilize AI technologies and work with data - and the search and analysis of social media trends are done with the help of machine learning algorithms. Extracting valuable insights from the data requires some preliminary work. This work is usually done by a data scientist. We decided to lift the curtain on what they do, how they work, and what a regular day in life of a YouScan data scientist looks like.
As you can tell by the title, the main tool of trade for data scientists is data. But not everyone knows just how much time it takes to collect and sort through primary data sources. Data scientists spend up to 80% of their resources to scrub the so-called "raw data" - or piles of unstructured information - that has not yet been processed or analyzed.
Contemporary artificial intelligence is often called "weak AI," since it cannot adequately react to unexpected situations, and demands dedicated cycles of training for each distinct task.
Historically, all AI "winters" (essentially, development plateaus) have been tied to disappointments of unreasonably high expectations from industries. As such, after a strong start in AI development in the 1960s, many researchers believed in the possibility of development of general-intelligence robots within decades.
However, today, artificial intelligence complements the potential of human cognition instead of replacing it, through a set of algorithms specifically designed to help solve complex problems. The training of such algorithms is usually done by data scientists. Sometimes the solutions are straightforward, where the data is already sorted by categories - for example, by the audience's gender and age. However, more often, the algorithms have to deal with totally unorganized raw numbers. Data scientists have to find patterns in the raw data, come up with hypotheses and test the numbers against these hypotheses.
Six people work in our data science department. All of them are math nerds, of course, because to succeed in this field, you need at least some basic knowledge of algorithms, probability theory, statistics, quantitative methods, and a little bit of Python coding skills. We hired our data scientists rather spontaneously - meeting them at conferences, professional chatrooms, or on recommendation - but always with consideration of their passion for this kind of work. All of them ran GitHub projects, actively participated in data science forum discussions, and also took part in various hackathons, conferences, competitions, etc.
Evgeny Terpil is the head of the data science department. He's been working as a data scientist for over three years, and he's already written dozens of completed models that can solve social media analytics problems. Several years ago, Evgeny graduated from the Kiev Polytechnic Institute and started working as a front-end developer. But data science has been his passion since his student days, and after getting hired at YouScan, Evgeny decided to change his specialization completely. By the way, he thinks that most data scientists start out as developers.
"You can be a data scientist without formal post-secondary training, but you do have to possess certain relevant skills and knowledge of computer science and math, and really live and breathe data science. So the knowledge of computer science fundamentals, in my opinion, gives the candidate many advantages. Those who seek professional development can take specialized online courses - for example, on Coursera - or participate in competitions on Kaggle, where you can test your code and learn independently," Evgeny says. "This is a great option for finding your place on the job market, where there are thousands of DS job postings around Russia and the Russian Commonwealth."
A data scientist's job can be very creative. One simply cannot predict all the possible quirks and inconsistencies of a data set. Data scientists often have to reevaluate models, conduct numerous experiments and launch different versions of an algorithm to compare results.
One of the latest features developed this way by our data scientists is the logo recognition function. Now we can find brand mentions on social media, even if there's no mention of it in the accompanying text - it's enough to have the brand's logo visible in the user's photos.
One of the things our data scientists are investigating is sentiment analysis of social media posts. There are three types of sentiment, or the verbal expression of one's opinion about something: positive, neutral, or negative. The problem with the kind of content we work with is that "live" social media content can vary. Expressions of positive or negative feelings can be subtle, and when people use sarcasm or irony, things get even more complicated.
Furthermore, it's important to analyze the context of the post. For example, a cleaning product that leaves your dishes clean and shiny is a good thing, while a similar post about Coca-Cola is not so good - it's unlikely that the beverage company wants to be associated with household cleaners. Such subtleties can be addressed by using deep learning neural networks with recurring memory layers. In these cases, we develop algorithms designed to detect context, identify discussion subjects and topics.
Our new model, based on recurring neural networks, is better at understanding the fine details of a mention, which has helped us significantly reduce the error rate of sentiment analysis. In these cases, data scientists are responsible for creating a rich training environment that can help the algorithms learn the subtleties of social media conversations. We are training the AI to detect sentiment for several brand categories, with considerations of their unique demands. This helps us achieve greater accuracy in sentiment detection.
Data scientists at YouScan start their day with coffee and some light reading - usually, English-language publications, where the latest trends and news tend to appear first. For example, our colleagues find lots of neat information on Medium and on arxiv.org, which is an electronic archive of journal articles and abstracts from Cornell University. All of the latest trends in data science and machine learnings are usually published there.
The Russian community of data scientists usually hangs out in the ODS (Open Data Science) Slack channel. It currently has about 12,000 members, all helping each other. Our team also participates in the VK Deep Learning group, and actively comment in posts about their areas of expertise on Habrahabr.
The YouScan team at the #AIUkraine2016 conference
The 5th Moscow Data Fest. Data Fest is the biggest conference that connects researchers, engineers and developers working in data science, machine learning and artificial intelligence fields.
Interestingly, the majority of scientific articles from subject matter experts are open-source. For example, Google and Facebook provide detailed explanations of their new algorithms and make special announcements about their latest AI training models at various scientific conferences. This means we can both use the technology developed by the industry leaders, as well as customize it for our own tasks, like social media analysis. By the way, we are able to provide our Visual Insights and logo recognition services, the quality of which rivals those available on European and North American markets, at a much lower cost.
Content analysis tasks get more and more complicated each day. For example, nowadays, developers are working on teaching algorithms to analyze various video formats - from Facebook and VK Stories, to detailed analysis of entire YouTube channels. Videos are essentially sets of images, which are already being analyzed successfully by our algorithms. But now, there are many, many more of them to analyze.
So, do you feel like you know more about what data scientists do? There's more to learn about this and other fascinating topics on our blog. Follow us on Facebook, Twitter and LinkedIn to stay up to date!