This post will cover how to extract data from Twitter using custom components in Talend open studio as well as a simple method for performing sentiment analysis on the twitter data.

What’s the point?

Before discussing how we extract twitter data and perform sentiment analysis, let’s discuss why we might want to do this in the first place. In the past it has been a difficult task to derive public opinion on a given subject, with opinion polls and surveys being the primary tools used. These methods are time consuming, often costly and are not without flaws.

Twitter has millions of users sending millions of tweets every day; these tweets are publicly available and free to access, and as such Twitter data is an ideal candidate for cheap, fast and potentially very effective public opinion mining. It goes without saying that this sort of insight into public opinion is a gold mine for any kind of organisation out there.

Getting Started

In order to analyse twitter data, first you will have to have some twitter data. To do this I used custom components created by Gabriele Baldassarre and available for download here;

http://gabrielebaldassarre.com/talend/twitter-components-talend/

A tutorial on how to set up the components in Talend and how to register a twitter application is available here;

http://gabrielebaldassarre.com/2014/07/20/getting-started-twitter-data-analysis-using-talend-open-studio/

Retrieving Tweets

I have decided to use the tTwitterStreamInput component rather than the tTwitterInput that was used in Gabriele’s post as I came across some issues with rate-limits when trying to retrieve larger volumes of tweets using the tTwitterInput component . The tTwitterStreamInput component uses the Twitter streaming API rather than the REST API, you can find out more here;

https://dev.twitter.com/overview/documentation

Before we start retrieving Tweets using Talend we must decide what fields we want to pull through from the API and what query keywords to use (there are other settings available but these are sufficient for the purposes of this article).

twitter inour settings

In the settings for the tTwitterStreamInput I am puling through 5 fields; Text, Location, Senders Screen Name, Creation date and Source. I have decided I would like to do some sentiment analysis on the keyword “weather”, as you can see in the query keywords section of the settings.

Below is my Talend job to retrieve twitter data. For help configuring the tTwitterOAuth components please refer to the tutorial that I link to earlier in the article.

Twitter Extract

Deriving the sentiment of Tweets

The sentiment of each Tweet is derived by assigning each word in the Tweet a sentiment score and then summing all the sentiment scores of the Tweet. We can determine the sentiment score of each word by using a words list with positive ratings up to plus 5 (for positive words) and as low as minus 5 (for negative words). This method is basic and not without its flaws, but before we go into a brief analysis later on let’s take a look at the implementation of this method using Talend.

The word list used for this exercise can be found here (click here to download the word list).

Talend Job

Sentiment of tweets

tweets Reads Tweets from a .csv file
tReplace_1 Removes all punctuation and special characters from tweet
tMap_1 Define 3 columns. tweet id (numeric sequence starting at 1 with a step of 1), full tweet (tweet text), tweet (tweet text)
tNormalize_2 Normalize the tweet field using newline (\n) as the item separator
tNormalize_1 Normalize the tweet field using white space (” “) as the item separator
tFilterRow_1 Filter out nulls rom tweet field
Sentiment Read sentiment word list from .csv file
tMap_2 Left outer join from tweet to sentiment text, pulling through the sentiment score for matches
tAggregateRow_1 Group by id and full tweet, sum the sentiment score
tFileOutputExcel_1 Saves the output in an excel spreadsheet

There are of course several ways to achieve the same outcome, I will explain the rationale behind the method used. The first step is to remove the punctuation and special characters from the tweet, this will ensure that words in the tweet will match words in the word list (e.g. a comma following  a word may cause “happy,” in the tweet to not match with “happy” in the word list).

After we have mapped the 3 columns in the tMap_1 we have an output that looks something like this;

tweet tmap 1 output

Note that both the tweet and full_tweet field contain the full tweet text at this stage, the reasons for this design will  soon become clear.

The next step is to normalize the tweet field, or in other words, to split each word from the tweet into its own row. After the two tNormalize components the output looks like this;

normalizeoutput

It should be clear from the above table that what we now have is, for each tweet, several rows of data that relate to each word in that tweet. This data is now ready for comparison with the sentiment word list.

In the second tMap we do a left outer join on tweet to sentiment_text in the sentiment word list and pull across the sentiment score to the output.

jointweet

After the second tMap we have the following output with a sentiment score for each word in the tweet that exists in the sentiment word list.

sentiment post t map

The next step is to group by the id and the full_tweet and sum the sentiment score for each tweet, which gives us our final import.

The Final Output

sentiment output full

We now have an output that gives us a tweet and the sentiment score of each tweet.

A Brief Comment and Analysis

There are many challenges in trying to perform sentiment analysis of Tweets. From a casual inspection of the output file I can see that there are several cases of falsely detected sentiment. For example, in the word list, “no” carries a sentiment of -1. This means that the phrase “no (-1) problem(-2)” would have a sentiment of -3, when in fact this is a positive phrase. -3 however is only weakly negative (in my output I have scores of  -33 through to +20) and through casual inspection it seems like false sentiment detection is drastically reduced for strongly positive/negative tweets. This seems logical and is an area for further analysis in the future.

Only a very low number of words in any given Tweet successfully match, this is most likely due to the fact that the sentiment word list used was largely designed for well written English when in fact many Tweets contain slang, abbreviations and misspellings. Another area for improvement could be to include emoticons in the analysis since this is potentially a very powerful way of deriving sentiment from Tweets.

If you are interested in another example of this sort of analysis and a more in depth explanation then take a look at the following link to a scientific paper where the researchers see how Twitter sentiment analysis compares in effectiveness to traditional opinion polling.

http://www.cs.cmu.edu/~nasmith/papers/oconnor+balasubramanyan+routledge+smith.icwsm10.pdf