Monday, July 18, 2011

Data Mining #NCDs Hash Tag with the Python Twitter API

I had to do some data mining work for my job at www.ncdaction.org so I thought I would post my research on my blog. With the help of the O'Reilly book
Mining the Social Web it was extremely easy to get good Twitter data using the
Python Twitter API. 

To mimic the results: 
1. Install Networkx i.e. $ easy_install networkx
2. Install the Python Twitter API i.e. http://code.google.com/p/python-twitter/
a. $ sudo python setup.py build
b. $ sudo python setup.py install
3. Install Numpy i.e. $ easy_install numpy
4. Install Twitter Command Line Tools i.e. $ easy_install twitter 

All of the source code is adapted from the O'Reilly Mining The Social Web book which can be found here: https://github.com/ptwobrussell/Mining-the-Social-Web.I analyzed the last 1500 tweets that contained words related to NCDs.


Results:
Total Words: 15964 Unique Words: 1694 Lexical Diversity: 0.106113755951 Avg Words Per Tweet: 19.0501193317


Discussion:
A lexical diversity of 0.106 means that 1 out of every 10 words in the
aggregated tweets is unique. Given that the average number of words in each
tweet is 19 words, that translates to just under 2 unique words per tweet.
This can be interpreted as meaning that each tweet carries about 10 percent
unique information.


Results: 
A Frequency Distribution of the 50 Most/Least Frequent Terms


Most Frequent Fifty most frequent tokens: [u'#NCDs', u'to', u'#LIVESTRONG', u'the', u'on', u'in', u'global', u'cancer', u'fight', u'leaders', u'crisis.', u'up', u'step', u'and', u'world', u'calling', u'me', u'Join', u'RT', u'&', u'NCDs', u'bit.ly/signonnow', u'a', u'for', u'of', u'#IAS2011', u'#hpm', u'pleaseRT;', u'http://t.co/TVUURLS', u'please', u'@Kirk_Dicko:', u'care', u'sign', u'4', u'#HIVNCD', u'are', u'1,000', u'@PeterASinger', u'Cr', u'Fin', u'Raise', u'Rs', u'Shriram', u'To', u'Transport', u'Via', u'you', u'@jfclearywisc:', u'PLEASE', u'not']


Least Frequent Fifty least frequent tokens: [u'track', u'treated', u'treatment,', u'truckloads', u'try', u'tr\xe8s', u'tudo', u'tuned:', u'tweets', u'twittformation', u'u', u'unanse', u'unirse', u'univeral', u'urban/industrial/globalization', u'urged', u'us?', u'use', u"vbygshljlkcagrj.?wn?p!ee'", u'vertelt', u'very', u'vida.', u'voc\xea', u'vortex', u'vou', u'voy', u'waarom', u'walk', u'want', u'warrants..', u'websites', u'welcome!!!', u'welcomes', u'well', u'when', u'whether', u'word', u'words?', u'worst', u'would', u'www.livestrong.org/signon', u'www.ncdalliance.org/100days', u'you.', u'young', u'your', u'zij', u'\xd1CDS', u'\xe0', u'\xe9', u'\u2013']




Extracting Relationships from the Tweets


Using the NetworkX graph package I created a DiGraph of the last 1500 tweets.

Results:
Number of Nodes:  179
Number of Edges:  124
Connected Components: 60
Degree of Each Node:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 4, 1, 2, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 16, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 5, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 5, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 4, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 2, 7, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1]

Discussion:
The number of nodes in the graph tells us that our of 1500 tweets, there
were 179 users involved in retweet relationships with one another, with 124
connections(edges) between them. The number of connected components that the graph
consists of 60 subgraphs and is not fully connected. That can also be interpreted
as there being 60 different networks of people who do not communicate with
each other. The degree of each node can be interpreted as meaning that since most
of the values are 1, that means a majority of people only retweet with 1 other
person.

Visualization of Subgraphs