Simple Python Twitter Search API Crawler Class

September 27, 2009 at 03:55 PM | categories: Python | View Comments |

I've been getting into Twitter (I'm @niallohiggins btw) a bit recently. One of the things I wanted to do was write a little program to periodically search for a specific tag and then process the results. The Twitter Search API is very easy to use, even if there are some annoying issues. Here is a very simple class I wrote to issue searches and return the results. It also keeps track of the last high water mark (max_id) of the previous search, so you hopefully won't get the same results twice - although you still want to code defensively for that in case there is a bug in Twitter. Feel free to use this code yourself. Note that you'll have to implement your own 'submit' method.

import httlib
import json
import logging
import socket
import time
import urllib

SEARCH_HOST="search.twitter.com"
SEARCH_PATH="/search.json"


class TagCrawler(object):
    ''' Crawl twitter search API for matches to specified tag.  Use since_id to
    hopefully not submit the same message twice.  However, bug reports indicate
    since_id is not always reliable, and so we probably want to de-dup ourselves
    at some level '''

    def __init__(self, max_id, tag, interval):
        self.max_id = max_id
        self.tag = tag
        self.interval = interval
        
    def search(self):
        c = httplib.HTTPConnection(SEARCH_HOST)
        params = {'q' : self.tag}
        if self.max_id is not None:
            params['since_id'] = self.max_id
        path = "%s?%s" %(SEARCH_PATH, urllib.urlencode(params))
        try:
            c.request('GET', path)
            r = c.getresponse()
            data = r.read()
            c.close()
            try:
                result = json.loads(data)
            except ValueError:
                return None
            if 'results' not in result:
                return None
            self.max_id = result['max_id']
            return result['results']
        except (httplib.HTTPException, socket.error, socket.timeout), e:
            logging.error("search() error: %s" %(e))
            return None

    def loop(self):
        while True:
            logging.info("Starting search")
            data = self.search()
            if data:
                logging.info("%d new result(s)" %(len(data)))
                self.submit(data)
            else:
                logging.info("No new results")
            logging.info("Search complete sleeping for %d seconds"
                    %(self.interval))
            time.sleep(float(self.interval))

    def submit(self, data):
        pass

Niall O'Higgins is an author and software developer. He wrote the O'Reilly book MongoDB and Python. He also develops Strider Open Source Continuous Deployment and offers full-stack consulting services at FrozenRidge.co.

blog comments powered by Disqus