Research for Global Development

Analyzing #HIV tweets using Excel VBA


Today interested citizens around the world are generating enormous amounts of data on countless topic of social interest. The challenge is how to simplify the large amount of data to make it easier to analyze.  While there are countless web-based tools, apps, and services, how might one go about conducting desktop analysis with such large and varied data using commonly available commercial and open source tools?

Using a free tool called Archivist by Mix Online (http://archivist.visitmix.com/), we collected one month’s worth of tweets containing a specific hashtag (#HIV) which yielded  44,269 tweets (10,070 unique users) talking about HIV on the web. We then set about to answer questions such as: who are the most influential people, who are the most interested users, what links and sources do they share with each other, and finally what are some other key terms related to HIV topic that interests these users.

The below charts show the top ten most mentioned users, key terms, most frequent retweeters, highest volume tweeters, and finally the top ten shared links on the topic of HIV. To speed up some of the data parsing, analysis, and visualization VB scripts were developed.

During the study period the twitter account “@newscientist” was the most mentioned person on the topic of HIV. Similarly, the user “@onetoughpirate” was the most active user with highest volume of tweets on the topic of HIV. The most shared link was “National HIV and STD Testing Resources” (http://t.co/1YZK014r) and finally the user with most retweets was @BTPMAY19  through of the one month ( May 18th – July 19th ) time period.

How does our toolkit work? The toolkit automatically breaks the master original xml file into organized data sets in different sheets of an excel workbook. The rest is fairly straight forward process to get your one page detailed graphs of the dataset. The toolkit was developed in VBA for Excel, and contains step by step instruction.

We found it to be a relatively simple, fast, and efficient way to use twitter data for analysis purposes. Once our tool is a bit further along, digital researchers can use it with the Archivist tool to mine large quantities of twitter social data on their desktops. The main goal of the toolkit is to simplify large sets of data and to search for specific information in the tweets by using metadata contained in the tweets/updates of millions of users. The toolkit could be customized to use even extremely complex filters. For example, demographics, gender, sentiment, influencers, and source aggregators.  This is a simple way to understand how many tweets are about your topic of interest from the 340 million tweets per day, and simply convert the information to actionable insights for your business.

While Excel may not be the most efficient and robust way to manage and analyze non 2D data like XML or JSON, it can serve as a starting point for the tool as well as be accessed by a wide range of users familiar with Microsoft Excel.

This was just one approach out of many.  For the next version of the tool, we might move to a more pure coding stack, rather than Excel.  One thought would be to use more of an XML database approach, using perhaps BaseX and XQuery.  Other approaches might be to R and the XML package or perhaps use Python along with the LXML and NumPy libraries library.  What stacks are you using to do similar analysis?  If you’re using commercial vendors or software, does it work well?  Are you able to access the raw data or does the system simply spit out the metrics from its black box of mystery?

InterMedia

Analyzing #HIV tweets using Excel VBA


Today interested citizens around the world are generating enormous amounts of data on countless topic of social interest. The challenge is how to simplify the large amount of data to make it easier to analyze.  While there are countless web-based tools, apps, and services, how might one go about conducting desktop analysis with such large and varied data using commonly available commercial and open source tools?

Using a free tool called Archivist by Mix Online (http://archivist.visitmix.com/), we collected one month’s worth of tweets containing a specific hashtag (#HIV) which yielded  44,269 tweets (10,070 unique users) talking about HIV on the web. We then set about to answer questions such as: who are the most influential people, who are the most interested users, what links and sources do they share with each other, and finally what are some other key terms related to HIV topic that interests these users.

The below charts show the top ten most mentioned users, key terms, most frequent retweeters, highest volume tweeters, and finally the top ten shared links on the topic of HIV. To speed up some of the data parsing, analysis, and visualization VB scripts were developed.

During the study period the twitter account “@newscientist” was the most mentioned person on the topic of HIV. Similarly, the user “@onetoughpirate” was the most active user with highest volume of tweets on the topic of HIV. The most shared link was “National HIV and STD Testing Resources” (http://t.co/1YZK014r) and finally the user with most retweets was @BTPMAY19  through of the one month ( May 18th – July 19th ) time period.

How does our toolkit work? The toolkit automatically breaks the master original xml file into organized data sets in different sheets of an excel workbook. The rest is fairly straight forward process to get your one page detailed graphs of the dataset. The toolkit was developed in VBA for Excel, and contains step by step instruction.

We found it to be a relatively simple, fast, and efficient way to use twitter data for analysis purposes. Once our tool is a bit further along, digital researchers can use it with the Archivist tool to mine large quantities of twitter social data on their desktops. The main goal of the toolkit is to simplify large sets of data and to search for specific information in the tweets by using metadata contained in the tweets/updates of millions of users. The toolkit could be customized to use even extremely complex filters. For example, demographics, gender, sentiment, influencers, and source aggregators.  This is a simple way to understand how many tweets are about your topic of interest from the 340 million tweets per day, and simply convert the information to actionable insights for your business.

While Excel may not be the most efficient and robust way to manage and analyze non 2D data like XML or JSON, it can serve as a starting point for the tool as well as be accessed by a wide range of users familiar with Microsoft Excel.

This was just one approach out of many.  For the next version of the tool, we might move to a more pure coding stack, rather than Excel.  One thought would be to use more of an XML database approach, using perhaps BaseX and XQuery.  Other approaches might be to R and the XML package or perhaps use Python along with the LXML and NumPy libraries library.  What stacks are you using to do similar analysis?  If you’re using commercial vendors or software, does it work well?  Are you able to access the raw data or does the system simply spit out the metrics from its black box of mystery?

Marketing Materials

Contact Us:

InterMedia Headquarters

1825 K Street, NW
Suite 650
Washington, D.C. 20006
+1.202.434.9310
FAX: +1 202 434 9560
Contact | View Map

InterMedia Africa

UN Avenue, Gigiri Nairobi
Box 10224
City Square 00200
Nairobi, Kenya
+254.720.109183
Contact | View Map