Python 3/4: Context Managers, XML and Networking

This is the third article in a series, to read the entire series go here.

Writing an RSS Aggregator with Python

For this article I’ll be putting together a simple RSS aggregator. This will shed a some light on Python’s XML and web libraries and introduce a way to minimize boilerplate code with context managers.

The final product will read through a file containing a list of RSS feeds and print the names of the channels and the articles they contain to screen.

To use it, type: python RssReader.py feeds.txt

Files Used in this Project

  • RssReader.txt: The sourcecode for this project, rename it to “RssReader.py” after downloading.
  • feeds.txt: A sample input file.

Libraries

import sys
import urllib2
import xml.sax
import xml.sax.handler
  • sys: A library for accessing information passed into or maintained by Python’s interpreter.
  • urllib2: A library for manipulating URL’s and opening HTTP streams.
  • xml.sax: Read from and write to XML files using SAX.
    • Tutorial: A good overview of parsing XML using the SAX library.

Parsing XML

The RSS feed parser extracts the names of the channels and the titles of each of the articles.

The handler below gets most of its functionality from its parent class. There are four methods added: startDocument, startElement, endElement and characters. These each fire as an XML document is being read. Because the methods themselves are generic enough to apply to any XML document, you need a lot of if/else statements inside to deal with specific kinds of XML document.

class RssHandler(xml.sax.handler.ContentHandler):
  def startDocument(self):
    self.inItem = False
    self.inTitle = False
    self.channels = {}
    self.currentChannel = None
    self.str = ""

def startElement(self, name, attrs):
    lname = name.lower()

    if lname == "item":
      self.inItem = True
    elif lname == "title":
      self.inTitle = True

def endElement(self, name):
    lname = name.lower()

    if lname == "item":
      self.inItem = False
    elif lname == "title":
      self.inTitle = False
      if self.inItem:
        self.channels[self.currentChannel] += [self.str]
      else:
        self.currentChannel = self.str
        if self.currentChannel not in self.channels.keys():
          self.channels[self.currentChannel] = []
      self.str = ""

  def characters(self, content):
    if self.inTitle:
      self.str += content

This class reads through an XML document and accumulates a relational list of the form: {"Channel 1":["Article 1", "Article 2"], "Channel 2":["Article 3"]}

Context Managers and Downloading Webpages

Context Managers

Files, network connections and other information that require streams of data generally require some amount of boilerplate to open and close those streams. This both adds uninteresting noise to your code and can cause unexpected bugs when you open a stream but fail to close it. Python deals with this through context managers.

Without context managers printing the lines of a file to screen looks like this:

file = open("infile.txt", "r")
print file.read()
file.close()

Though this is clear and obvious in a small example, in real code there can be problems between opening and closing the file, which cause it not to be closed properly. The standard fix is to wrap this with a try/finally to deal with any exceptions that are thrown in the process:

file = open("infile.txt", "r")
try:
  print file.read()
finally:
  file.close()

Where finally makes certain that, regardless of what happens in the middle, the file gets closed. This is effective, but contains a fair amount of boilerplate for a simple action. Context managers allow us to do this:

with open("infile.txt", "r") as file:
  print file.read()

The open function returns a file object which contains the context manager. Any object can become a context manager object by adding an __enter__ and __exit__ method for creation and cleanup. Many of Python’s standard stream objects have these methods added already.

Downloading Webpages

Unfortunately the urlopen function from urllib2 does not come with a context manager, so I had to write one myself. Fortunately, it’s easy.

class Url():
  def __init__(self, url):
    self.url = url

  def __enter__(self):
    self.stream = urllib2.urlopen(self.url)
    return self.stream

  def __exit__(self, type, value, traceback):
    self.stream.close()

Tying it All Together

Using the XML Parser

def generateRsses(feedFile):
  with open(feedFile, "r") as file:
    urls = [url.strip() for url in file.readlines()]

  for url in urls:
    with Url(url) as rss:
      handler = RssHandler()
      parser = xml.sax.make_parser()
      parser.setContentHandler(handler)
      parser.parse(rss)
      yield handler.channels

def printFeed(rss):
  for channelName in rss.keys():
    print "*** " + channelName + " ***"
    for title in rss[channelName]:
      print "\t" + title

The most complex part of this is the actual parsing of the xml, I’ll walk you through it line by line.

  1. Create the RssHandler (the class created above). This defines how the parser will work.
  2. make_parser creates an object that does the XML heavy lifting.
  3. Attach the handler to the parser.
  4. Do the actual parsing.
  5. Extract the resulting data.

Commandline Arguments

if __name__ == "__main__":
  [scriptName,feedFileName] = sys.argv
  for rss in generateRsses(feedFileName):
      printFeed(rss)

The only new element here is sys.argv. sys.argv contains all of the arguments handed into the python interpreter. The first is the name of the script itself, so the program ignores it, the second is the name of the file that contains a list of RSS feeds.

Resources:

Leave a Reply