Virtuous Programmer Adventures of an Autodidact

15Nov/100

Python 3/4: Context Managers, XML and Networking

This is the third article in a series, to read the entire series go here.

Writing an RSS Aggregator with Python

For this article I'll be putting together a simple RSS aggregator. This will shed a some light on Python's XML and web libraries and introduce a way to minimize boilerplate code with context managers.

The final product will read through a file containing a list of RSS feeds and print the names of the channels and the articles they contain to screen.

To use it, type: python RssReader.py feeds.txt

Files Used in this Project

  • RssReader.txt: The sourcecode for this project, rename it to "RssReader.py" after downloading.
  • feeds.txt: A sample input file.

Libraries

  1. import sys
  2. import urllib2
  3. import xml.sax
  4. import xml.sax.handler
  5.  
  • sys: A library for accessing information passed into or maintained by Python's interpreter.
  • urllib2: A library for manipulating URL's and opening HTTP streams.
  • xml.sax: Read from and write to XML files using SAX.
    • Tutorial: A good overview of parsing XML using the SAX library.

Parsing XML

The RSS feed parser extracts the names of the channels and the titles of each of the articles.

The handler below gets most of its functionality from its parent class. There are four methods added: startDocument, startElement, endElement and characters. These each fire as an XML document is being read. Because the methods themselves are generic enough to apply to any XML document, you need a lot of if/else statements inside to deal with specific kinds of XML document.

  1. class RssHandler(xml.sax.handler.ContentHandler):
  2.   def startDocument(self):
  3.     self.inItem = False
  4.     self.inTitle = False
  5.     self.channels = {}
  6.     self.currentChannel = None
  7.     self.str = ""
  8.  
  9. def startElement(self, name, attrs):
  10.     lname = name.lower()
  11.  
  12.     if lname == "item":
  13.       self.inItem = True
  14.     elif lname == "title":
  15.       self.inTitle = True
  16.  
  17. def endElement(self, name):
  18.     lname = name.lower()
  19.  
  20.     if lname == "item":
  21.       self.inItem = False
  22.     elif lname == "title":
  23.       self.inTitle = False
  24.       if self.inItem:
  25.         self.channels[self.currentChannel] += [self.str]
  26.       else:
  27.         self.currentChannel = self.str
  28.         if self.currentChannel not in self.channels.keys():
  29.           self.channels[self.currentChannel] = []
  30.       self.str = ""
  31.  
  32.   def characters(self, content):
  33.     if self.inTitle:
  34.       self.str += content
  35.  

This class reads through an XML document and accumulates a relational list of the form: {"Channel 1":["Article 1", "Article 2"], "Channel 2":["Article 3"]}

Context Managers and Downloading Webpages

Context Managers

Files, network connections and other information that require streams of data generally require some amount of boilerplate to open and close those streams. This both adds uninteresting noise to your code and can cause unexpected bugs when you open a stream but fail to close it. Python deals with this through context managers.

Without context managers printing the lines of a file to screen looks like this:

  1. file = open("infile.txt", "r")
  2. print file.read()
  3. file.close()
  4.  

Though this is clear and obvious in a small example, in real code there can be problems between opening and closing the file, which cause it not to be closed properly. The standard fix is to wrap this with a try/finally to deal with any exceptions that are thrown in the process:

  1. file = open("infile.txt", "r")
  2. try:
  3.   print file.read()
  4. finally:
  5.   file.close()
  6.  

Where finally makes certain that, regardless of what happens in the middle, the file gets closed. This is effective, but contains a fair amount of boilerplate for a simple action. Context managers allow us to do this:

  1. with open("infile.txt", "r") as file:
  2.   print file.read()
  3.  

The open function returns a file object which contains the context manager. Any object can become a context manager object by adding an __enter__ and __exit__ method for creation and cleanup. Many of Python's standard stream objects have these methods added already.

Downloading Webpages

Unfortunately the urlopen function from urllib2 does not come with a context manager, so I had to write one myself. Fortunately, it's easy.

  1. class Url():
  2.   def __init__(self, url):
  3.     self.url = url
  4.  
  5.   def __enter__(self):
  6.     self.stream = urllib2.urlopen(self.url)
  7.     return self.stream
  8.  
  9.   def __exit__(self, type, value, traceback):
  10.     self.stream.close()
  11.  

Tying it All Together

Using the XML Parser

  1. def generateRsses(feedFile):
  2.   with open(feedFile, "r") as file:
  3.     urls = [url.strip() for url in file.readlines()]
  4.  
  5.   for url in urls:
  6.     with Url(url) as rss:
  7.       handler = RssHandler()
  8.       parser = xml.sax.make_parser()
  9.       parser.setContentHandler(handler)
  10.       parser.parse(rss)
  11.       yield handler.channels
  12.  
  13. def printFeed(rss):
  14.   for channelName in rss.keys():
  15.     print "*** " + channelName + " ***"
  16.     for title in rss[channelName]:
  17.       print "\t" + title
  18.  

The most complex part of this is the actual parsing of the xml, I'll walk you through it line by line.

  1. Create the RssHandler (the class created above). This defines how the parser will work.
  2. make_parser creates an object that does the XML heavy lifting.
  3. Attach the handler to the parser.
  4. Do the actual parsing.
  5. Extract the resulting data.

Commandline Arguments

  1. if __name__ == "__main__":
  2.   [scriptName,feedFileName] = sys.argv
  3.   for rss in generateRsses(feedFileName):
  4.       printFeed(rss)
  5.  

The only new element here is sys.argv. sys.argv contains all of the arguments handed into the python interpreter. The first is the name of the script itself, so the program ignores it, the second is the name of the file that contains a list of RSS feeds.

Resources:

Posted by Frank Berthold

Filed under: python, xml, rss Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.