Persistent Tweet Collection in Python

Last year, my brother and I began a project that required collecting lots and lots of tweets to analyze. So far, we’ve collected over 9.5 million geo-located tweets from roughly 20 US cities. Here’s how we did it.

Twitter is a massive global platform for social media that may or may not help people overthrow governments. Twitter provides a public API for developers and researchers. So, if you want to make yourself a dataset of tweets, sign up.

This post will explain how to use the streaming API to collect a sample of tweets in real time using Python 2.7 and the twython modules. It looks like I get roughly 1.5 to 2 tweets per second using this method. The streaming API also allows you to set parameters for the tweets you want to collect.

After importing a few modules (including twython), we set our OAUTH tokens and keys. You can find these when you register a new program through the Twitter developers site. Next, we create a new class called MyStreamer that inherits attributes from TwythonStreamer. We will create an instance of this class every time we start streaming tweets from the API.

The program then runs a permanent while loop. Every 1,000 tweets collected, it pauses, writes the tweets to a time-stamped compressed CSV file, and then resumes. The 1,000 tweet threshold was chosen to minimize tweet loss if the streaming object throws an error. It would probably be better to write tweets directly to a SQL database but that’s beyond the scope of this post. If the program cannot connect to Twitter for any reason (like connection problems), it will try again once every 60 seconds.

The code below requires Python 2.7 and a few special modules including twython. Remember to supply your APP_KEY, APP_SECRET, OAUTH_TOKEN, and OAUTH_TOKEN_SECRET.

Finally, because this code pulls geo-located tweets only, you should set the area you’re interested in. In stream.status.filter, alter the locations parameter such that the first two coordinates are the south-western latitude and longitude and the next two are the north-eastern. You can string together several locations (up to 25, I think), but Twitter is not good at strictly following the provided bounds.

Recent Posts

Written by:

Benjamin Radford is a data scientist and political scientist. He received his Ph.D. in Political Science from Duke University where he studied security, peace, & conflict and political methodology. He specializes in data science, cybersecurity, political forecasting, and arms proliferation. He is currently a Principal Data Scientist with Sotera Defense Solutions.