Mining twitter for useful information

May 6, 2012

The tale of a weekend project

I love the occasional weekend project. Getting something launched quickly can be rejuvenating, especially if you're in the middle of a long term project like a startup.

In this blog post I'm going to walk through the code that is at the core of a small site I built in a weekend called twtspire. The point of the site is to mine twitter for app and website ideas that would make for fun and/or useful projects to work on. I figured people would be venting on twitter about pain points that could be solved with an app or just tossing out app ideas, and I wanted to surface those ideas so that I could build stuff people actually wanted.

Twitter has a great API that allows you to search for tweets by keywords or phrases. In python, it's simply:

My first step was finding the best search queries to use. The type of tweets I wanted went something like: "I wish someone would build a site that...". Of course the variations and phrasing that people could and do use is countless, so I figured that the best approach would be to use multiple search queries that are very permissive (ie. allow for extra words and variations in the words) and then do a second filtration step using regex to filter out the results that differed too wildly from the desired phrase.

The search queries that I came up with were:

I avoided a phrase search (ie. enclosing the query in quotes) to allow for filler words like adjectives, and adverbs. ie. "Somebody really needs to create a site...".

The next step was to filter the results further by keyword order (results where the words in my query were out of order were rejected) and separation distance (if the words are too far apart then they're probably not what I'm looking for). In order to do this I built a regex from the search query.

For the query "wish there was app", this code would generate a regex like so:

wish(\s\S+){0,2}\sthere(\s\S+){0,2}\swas(\s\S+){0,2}\sapp

The "\s" matches any single whitespace character, and "\S+" matches any sequence of non-whitespace characters, so "\s\S+" would match a word in the tweet including the whitespace preceding it. "(\s\S+){0,2}" is a convenient way of saying allow zero to two words separating the keywords -- the curly braces are repetition operators.

It's pretty quick and dirty, but it worked surprisingly well. The search query and filtration steps produced decent results. Unfortunately, it also returned a lot of job postings and retweets, so I created a blacklist to detect the presence of words like "odesk".

Putting this all together, the final code looked like this. I'm well aware that this script is very inefficient. The queries and regexes could all be merged together, so you wouldn't have to do a request per phrase... I'll just leave that as an exercise for the reader ;)

On a free Amazon EC2 micro instance, I set up a cronjob that runs every few minutes and cycles through the different search queries. You can check out the end result at twtspire, which I'm pretty happy with. I still check it periodically. In fact, I'm working on a small app based on an idea that kept popping up on the site!

If you like this blog post, you can follow me on twitter.