Does social media contain enough information to predict real life metrics? Various papers have come out claiming predictive power of stock market indices from social media discourse. One of my projects at Idibon was building a custom NLP model on Twitter data about cars modeling intent-to-purchase. A quick summary of the results is that social media discourse definitely has predictive power, but data disambiguation process and model quality make all the difference.

Marketers need to understand people’s awareness and opinions of their brands on social media. Ideally, they will be able to predict sales based on this information and use this to inform marketing campaigns and business decisions.

Our solution is to move away from general models, which are often too broad in definition to pick up brand-specific idiosyncrasies, and build a custom model. A general sentiment model, which gives the overall mood of a social media post as positive, neutral, or negative, can be inadequate and even misleading in these pursuits. After all, many posts of high informative value about customer feelings and loyalty can carry a neutral sentiment or contain multiple conflicting sentiments.

There are a few fundamental questions that a digital marketer must answer. Is it true that if more people are happy about a product on social media, then more people will buy it? Even if we can be sure the answer is affirmative, to what extent does positive attitude on social media affect sales? Can one justify spending X amount of resources on social media engagement if the return on this investment cannot be quantified?

Here at Idibon, we have the means to test this hypothesis and offer a statistically rigorous answer to both questions in a case study of Chevy Impala. We collected tweets about Chevy Impalas spanning 34 months, from July 2012 until April 2015. We then built a custom sentiment model for car-directed positive sentiment. In this case it was Chevy Impala-specific. This means that we taught our model to distinguish “I’m happy, and now I’m buying an Impala” as irrelevant since the positive sentiment is not directed toward the car, and “I’m happy because I’m buying an Impala” as relevant.


We found that our custom, car-directed sentiment model is different from a general sentiment model. Positive sentiment toward Chevy Impalas consistently decreases starting February 2014, while the trend in overall positive tweets with the word ‘impala’ remains more or less unchanged.  Indeed, February of 2014 was the beginning of a particularly bad year for General Motors, the owner of the Chevrolet brand, who had 68 distinct incidences of recall, the most publicized of which affected Chevy Impalas and involved the ignition switch flaw that can cause vehicles to shut off safety features, power steering, and braking while driving (see article for details).

The real question is, can Twitter sentiment be used to predict actual sales?

Recall that we started by posing two questions: (1) Does Twitter sentiment (general or custom) affect sales? and (2) If yes, then how big is the effect?

The answer is YES and considerably large for our custom car-directed sentiment model, but NO for the general Twitter sentiment model.

The number of Impala-specific positive tweets has a visible relationship with monthly sales (right plot below). The correlation coefficient [1] of 0.42 means that if the number of car-directed positive tweets increases by 1 standard deviation, then the sales of cars increases by 0.42 standard deviations. In other words, if the number of Impala-directed positive tweets increases by 1500 tweets, then the monthly sales will increase by 1100 cars.

This relationship is highly significant with p-value 0.007, which means that the correlation coefficient we discovered is no accident. In contrast, the number of general positive tweets about Impalas and monthly sales (left plot below) have no visible relationship. This observation is validated by the small correlation coefficient (0.098), which is also not statistically significant (p-value is 0.56).

LEFT: normalized general positive sentiment vs. normalized monthly sales of Chevy Impalas. RIGHT: Normalized car-directed positive sentiment vs. normalized monthly sales of Chevy Impalas. The customized model (right) predicts sales, while the generic one doesn’t.

In summary, in order to get insights into customers’ feelings and predict sales from social media engagement, it is necessary to use a custom sentiment model that can answer the question, “What is the author’s sentiment toward the product or brand?” instead of a generic sentiment model that will only be able to tell the overall sentiment of the tweet.


[1]  We used Pearson’s correlation coefficient, which assumes normal distribution of the variables. We also computed Kendall’s tau correlation measure, which does not make any assumptions on the distribution of variables, and the resulting coefficient is 0.28, which is significant at 0.01.