Prolog 3/4: XML Parsing

Reading, Parsing and Displaying RSS Data in Prolog

This is the third article in a series, to read it from the beginning go here.

The Plan

This week is when I really put the assertion that Prolog is a general purpose language to the test. Most decently rational Prolog programmers will tell you to do things like XML parsing and display through another language, then use Prolog dll’s to do the interesting logical stuff. But, it does have the libraries, so let’s see how they work.

What to Expect

When this article is complete:

  • You will have:
    • A command-line RSS aggregator.
  • You will know:
    • How to read a text file.
    • How to read from an HTTP stream.
    • How to parse an XML file.
    • How to pretty print text to the screen.

Files Used in this Project

Compiling

To compile “xml.pl”, navigate to the directory that contains it on the commandline, then execute:

swipl -g main -o xml.exe -c xml.pl

The Code

Libraries Used

:- consult(library(sgml)).
:- consult(library(http/http_open)).
  • sgml: The SGML library, which also handles XML.
  • http/http_open: The HTTP library handles server and client code, I just used the small bit that let’s me treat an HTTP stream like a file.

Reading a File

readConfig(Lines) :-
  open('feeds.txt', read, ConfigStream),
  read_stream(ConfigStream, Lines).

read_stream(ReadStream, Lines) :-
  read_line_to_codes(ReadStream, Line),
  read_stream(ReadStream, Line, Lines).
read_stream(ReadStream, end_of_file, []) :- close(ReadStream), !.
read_stream(ReadStream, OldLine, [Atom | Lines]) :-
  read_line_to_codes(ReadStream, NewLine),
  string_to_atom(OldLine, Atom),
  read_stream(ReadStream, NewLine, Lines).

read_stream is a helper function to read multiple lines from a stream. It reads a stream in one line a time and returns/unifies its contents to Lines. Most of it is similar to what you saw in the previous article, recursively pulling off one line at a time, testing if it’s the last line and adding it to the list. The two key elements that are worth noting are: the use of the atom end_of_file, which is what the file stream returns when the stream is empty and the use of ! to indicate that if execution makes it to closing the stream, then there’s no need to backtrack.

readConfig‘s functionality is pretty obvious. Open a file using an atom with the file’s name. Remember, single-quotes are for atoms, double quotes are for strings. This opens a read only stream and unifies it with ConfigStream, then read_stream pulls the text in the file out into Lines.

Reading From HTTP

rssFetch(URL, XML) :-
  http_open(URL, XmlStream, []),
  load_xml_file(XmlStream, XML),
  close(XmlStream).

The pattern here is similar to the one for readConfig, except that it extracts the XmlStream straight to a Prolog XML structure. There’s quite a bit to it, but the short version is, it’s a nested linked list. There are a variety of functors in that list, but the one that interests us is element which has the form element(<element name>, <attributes>, <sub elements>). For example when given the xml:


  in text

load_xml_file returns:

[element(testOuter, [], 
    ['\n  ', element(inside, [inAtt=hello], ['in text']), '\n'])].

Parsing XML

rssRead([], []).
rssRead([element(channel, _, Elements) | _], [Rss | Rsses]) :-
  channelRead(Elements, Rss), rssRead(Elements, Rsses).
rssRead([element(rss, _, Elements) | _], Rss) :-
  rssRead(Elements, Rss).
rssRead([_ | Elements], Rss) :-
  rssRead(Elements, Rss).

channelRead(Elements, channel(Name, Titles)) :-
  titleRead(Elements, Name), itemsRead(Elements, Titles).

itemsRead([], []).
itemsRead([element(item, _, Elements) | Items], [ Title | Titles ]) :-
  titleRead(Elements, Title), itemsRead(Items, Titles).
itemsRead([_ | Items], Titles) :-
  itemsRead(Items, Titles).

titleRead([element(title, _, [Text | _]) | _], Text) :- !.
titleRead([_ | Elements], Text) :-
  titleRead(Elements, Text).

Strictly speaking Prolog is a weakly typed language, you have only a couple of basic types which all reduce to atoms and functors. But with those you can do the same tricks that are generally associated to algebraic type systems. Here is one example where I use the element functor to recurse through the XML structure and extract the titles.

Once you know how Prolog’s XML structures work, it’s just a matter of figuring out what information you need and recursing until you get it. For this purpose I’ve created my own functor channel(<channel name>, <title list>).

In Prolog functors and atoms are created simply by using them. This makes writing the code feel a lot like freeform sketching a picture. I like it. But I’m pretty sure that unless I made an effort to document my code well, it would make large projects unwieldy.

Pretty Printing Text

displayRss(Channels) :-
  foreach(member(channel(Name, Titles), Channels),
      (writef("*** %w ***\n", [Name]), displayTitles(Titles), nl)).

displayTitles(Titles) :-
  foreach(member(Title, Titles), writef("\t%w\n", [Title])).

writef works similarly to printf in nearly every language I’ve worked in. It has a few quirks that are detailed in its docs.

Main And Helper Text

readAndDisplayRss(URL) :-
  catch((rssFetch(URL, XML), 
        rssRead(XML, RSS), 
        displayRss(RSS)), _, readAndDisplayRss(URL)).

main :-
  readConfig(FeedURLs),
  foreach(member(URL, FeedURLs), readAndDisplayRss(URL)),
  halt.  

The above is glue code. It’s been written to continue retrying until it succeeds or runs out of stack, so make sure your URL’s are valid.

Final Summary

When I was reading up on Prolog, I heard predicates described as a generalization on functions. This week has driven that one home. Working in Prolog has felt a great deal like when I was first learning Haskell. I’ll spend hours fighting with something that I could do in my sleep in another language, then when I figure it out it is extraordinarily clear, obvious and feels cleaner than anything else I’ve worked with.

Hard Lessons Learned

foreach is not the same as a for comprehension/list comprehension, no matter how much it looks like one. The key difference is that the second term in the statement does not contain its own namespace. What happens in there happens throughout the same namespace that contains the foreach. For example:

main :-
  foreach(member(Number, [1, 2, 3, 4]), 
      (Square is Number * Number,
       write(Square), nl)).

Does not print the square of the numbers 1 through 4. It first asserts that Square is equal to 1 * 1, then that it is also equal to 2 * 2, 3 * 3, and 4 * 4, which is clearly false and ends execution of the foreach. It is possible to do this, the safe and idiomatic way is to break the second term out into its own predicate:

showSquare(Number) :-
  Square is Number * Number,
  write(Square), nl.

main :-
  foreach(member(Number, [1, 2, 3, 4]), showSquare(Number)).

That lesson took me half a day for me to learn. You’re welcome.

Coming Up

Next week I’ll be working out how to use XPCE, Prolog’s native GUI library.

Resources

Scala 3/4: XML

This is the third article in a series, to read the whole series go here.

Writing an RSS Aggregator with Scala

One of the cooler things about Scala is its XML processing and production features. Scala has several powerful mechanisms for creating DSL’s (Domain Specific Languages) which are essentially special purpose languages inside the parent language. The details of how it goes about this are beyond the scope of this article, but I’ll show you one way you can enjoy some of the results.

To illustrate how Scala can be used to process and generate XML, I’ll be putting together a simple RSS aggregator. It will print to the screen and write an html file with the channel titles and article titles.

Files Used in this Project

Libraries

import scala.xml._
import java.net.URL
  • scala.xml: Scala’s XML library has a vast number of features which I’ve only begun to sample.
  • java.net.URL: Java’s library for URL’s and HTTP connections.

I’ve mentioned that Java libraries can be called as though they were Scala libraries elsewhere, this is the first time I’ve actually done it. It’s painless as advertised.

Main Functions, Arrays and Command Line Arguments

object RssReader {
  def main(args : Array[String]) : Unit = { 
    val rssFeeds = combineFeeds(getRssFeeds(args(0)))
    printFeeds(rssFeeds)
    writeFeeds(rssFeeds)
  }

One piece of syntax that is especially confusing in Scala is how it indicates that one type contains another, in this case Array[String]. This syntax looks a lot like what you’ll see for array indexing in most other languages. In Scala arrays are treated as functions that receive an integer argument to return their values, eg args(0) returns the first argument of args. This is something of a simplification, but it helps me remember what the syntax is.

Anonymous Functions, Maps and Folds

  def getRssFeeds(fileName : String) : List[(String, Seq[String])] = {
    // Given a config file containing a list of rss feed URL's,
    // returns a list with ('channel name', ['article 1', 'article 2'])
    var baseList : List[(String, Seq[String])] = List()
    getFileLines(fileName).map(extractRss).foldLeft(baseList)
        { (l, r) => l ::: r.toList }
  }

Anonymous Functions

The function on line 4, (l, r) => l ::: r.toList is a case of an anonymous function in Scala, I’ll break it down: 1. (l, r) =>: The arguments section, if I wanted to be pedantic I could specify their type, but anonymous functions already clutter code and their purpose should usually be pretty obvious. 2. l ::: r.toList: l is an already existing list object which is being prepended to r after r is turned into a list.

Higher-order functions

Scala’s for syntax is extraordinarily powerful, but sometimes it’s more than you need. The three most common higher-order functions are:

  • map: Apply a function to each element in a sequence.
  • filter: Use a boolean function to filter the elements of a sequence.
  • fold(generally split into foldLeft and foldRight): Use a 2-arity function to combine all of the elements in a sequence into 1 element.

The for syntax combines the functionality of map and filter, so is most useful when you need to do both to a sequence. It does not cover the functionality of fold. Here I’m:
1. Getting a sequence which contains the URL’s of several RSS feeds. 2. Mapping a extractRss over them to get the RSS data. 3. Folding the anonymous function (l, r) => l ::: r.toList over them to change them from immutable Seq values to Lists and combine them.

Filter and Some Sugar

  def combineFeeds(feeds : List[(String, Seq[String])]) 
      : List[(String, Seq[String])] = {
     // Cleanup function, given a list of feeds it combines duplicate channels.
    def combinedFeed(feedName : String) : Seq[String] = {
      feeds.filter(_._1 == feedName).map(_._2).flatten
    }

    val feedNames = feeds.map(_._1).distinct

    feedNames.map(x => (x, combinedFeed(x)))
  }

The filter function in combineFeed makes use of the fact that it is in combineFeeds‘s namespace and doesn’t need to have feeds passed to it. It filters through feeds, keeping those with the same name. It then passes the result to map which extracts the article lists, and flatten combines each list. Generally you would either pass a named function with a single argument to filter and map, or an anonymous function with the format (x) => x == feedName. This pattern comes up so often that Scala allows us to ignore the usual anonymous function syntax and use the underscore as a placeholder for a single variable.

Brackets are Optional in Single Liners

  def getFileLines(fileName : String) : Array[String] = 
    // Extracts the URL's from a file and breaks them up into individual strings.
    scala.io.Source.fromFile(fileName).mkString.split("\n") 

Opening Network Connections and Reading XML

  def extractRss(urlStr : String) : Seq[(String, Seq[String])] = {
    // Given a URL, returns a Seq of RSS data in the form:
    // ("channel", ["article 1", "article 2"])
    val url = new URL(urlStr)
    val conn = url.openConnection
    val xml = XML.load(conn.getInputStream)

    for (channel < - xml \\ "channel") 
      yield {
        val channelTitle = (channel \ "title").text
        val titles = extractRssTitles(channel)
        (channelTitle, titles)
    }
  }

Opening Network Connections

For practical purposes Lines 2-4 are straight Java. They format a URL, open a connection, download the stream and convert it into the XML library's internal format.

Reading the XML

Line 6 is where I start making good on my promise of easy XML processing. xml \\ "channel" descends through the XML tree and extracts every element named "channel", we then use the for expression to take each of those elements and treat them each as XML trees in their own right. The \\ will descend through the entire tree, while \ only examines those nodes that are immediately under the given root.

  def extractRssTitles(channel : Node) : Seq[String] = {
    // Given a channel node from an RSS feed, returns all of the article names
    for (title < - (channel \\ "item") \\ "title") yield title.text
  }

In extractRssTitles you can see how \\ can be nested to do more complex dives into an XML document.

Display

To Screen

  def printFeeds(feeds : List[(String, Seq[String])]) : List[Seq[Unit]] = { 
    // Given a list of ("channel", ["article 1", "article 2"]), prints
    //  them each to screen.
    for (feed < - feeds) yield {
          println("*** " + feed._1 + " ***")
          for (title <- feed._2) yield println("\t" + title)
    }
  }

printFeeds takes a list of the feed objects and displays them on screen. There are a couple of ways to extract data from a tuple. If you're dealing with a complex set of structures, you can use match/case. I knew I'd always be dealing with a 2-tuple, so I chose to use the _# syntax, where _1 will give you the first value in the tuple, etc.

To HTML

  def writeFeeds(feeds : List[(String, Seq[String])]) = {
    // Given a list of ("channel", ["article 1", "article 2"]), generates
    //  and writes an HTML document listing the articles with channel names
    //  as headers.
    val html = 
      
        {for (feed < - feeds) yield
          

{feed._1}
    {for (title < - feed._2) yield
  • {title}}
} XML.saveFull("rss.html", html, "UTF-8", true, null) } }

When you produce XML in most languages you have to choose between using hard to read code that produces guaranteed well formed XML, or generating your XML out of strings that are easy to read but you can't be sure is well formed.

This is what has had me excessively excited about Scala for the past week. It seems that the majority of projects I get involved in sooner or later involve doing some kind of work with XML, and it's usually one of the ugliest parts of the job. Here you can see I have XML freely interspersed with the rest of my code, and was by far the easies part of what I wrote this week. At any time inside the XML, if I need to break back out into Scala I just surround the statement I want to use in curly braces. With it I've been able to clearly produce XML that I can be confident is well formed.

My Experience

Arrays

An unpleasant surprise I got while I was working on this program was that, unlike other sequence types, Arrays in Scala have a simple way to convert to Lists.

Functional Programming

Reading over my own code, I realize that functional style has become a habit. It'll be interesting to see how well I cope when I try to pick up a language where functional style isn't a realistic option. To anyone reading, if there are places in my Scala code where it would be more clear if I used a more object oriented style, please let me know.

XML

One more time. The XML processing capabilities of this language are beautiful. Scala is going to end up being part of my standard arsenal for this reason alone.

Resources:

Python 3/4: Context Managers, XML and Networking

This is the third article in a series, to read the entire series go here.

Writing an RSS Aggregator with Python

For this article I’ll be putting together a simple RSS aggregator. This will shed a some light on Python’s XML and web libraries and introduce a way to minimize boilerplate code with context managers.

The final product will read through a file containing a list of RSS feeds and print the names of the channels and the articles they contain to screen.

To use it, type: python RssReader.py feeds.txt

Files Used in this Project

  • RssReader.txt: The sourcecode for this project, rename it to “RssReader.py” after downloading.
  • feeds.txt: A sample input file.

Libraries

import sys
import urllib2
import xml.sax
import xml.sax.handler
  • sys: A library for accessing information passed into or maintained by Python’s interpreter.
  • urllib2: A library for manipulating URL’s and opening HTTP streams.
  • xml.sax: Read from and write to XML files using SAX.
    • Tutorial: A good overview of parsing XML using the SAX library.

Parsing XML

The RSS feed parser extracts the names of the channels and the titles of each of the articles.

The handler below gets most of its functionality from its parent class. There are four methods added: startDocument, startElement, endElement and characters. These each fire as an XML document is being read. Because the methods themselves are generic enough to apply to any XML document, you need a lot of if/else statements inside to deal with specific kinds of XML document.

class RssHandler(xml.sax.handler.ContentHandler):
  def startDocument(self):
    self.inItem = False
    self.inTitle = False
    self.channels = {}
    self.currentChannel = None
    self.str = ""

def startElement(self, name, attrs):
    lname = name.lower()

    if lname == "item":
      self.inItem = True
    elif lname == "title":
      self.inTitle = True

def endElement(self, name):
    lname = name.lower()

    if lname == "item":
      self.inItem = False
    elif lname == "title":
      self.inTitle = False
      if self.inItem:
        self.channels[self.currentChannel] += [self.str]
      else:
        self.currentChannel = self.str
        if self.currentChannel not in self.channels.keys():
          self.channels[self.currentChannel] = []
      self.str = ""

  def characters(self, content):
    if self.inTitle:
      self.str += content

This class reads through an XML document and accumulates a relational list of the form: {"Channel 1":["Article 1", "Article 2"], "Channel 2":["Article 3"]}

Context Managers and Downloading Webpages

Context Managers

Files, network connections and other information that require streams of data generally require some amount of boilerplate to open and close those streams. This both adds uninteresting noise to your code and can cause unexpected bugs when you open a stream but fail to close it. Python deals with this through context managers.

Without context managers printing the lines of a file to screen looks like this:

file = open("infile.txt", "r")
print file.read()
file.close()

Though this is clear and obvious in a small example, in real code there can be problems between opening and closing the file, which cause it not to be closed properly. The standard fix is to wrap this with a try/finally to deal with any exceptions that are thrown in the process:

file = open("infile.txt", "r")
try:
  print file.read()
finally:
  file.close()

Where finally makes certain that, regardless of what happens in the middle, the file gets closed. This is effective, but contains a fair amount of boilerplate for a simple action. Context managers allow us to do this:

with open("infile.txt", "r") as file:
  print file.read()

The open function returns a file object which contains the context manager. Any object can become a context manager object by adding an __enter__ and __exit__ method for creation and cleanup. Many of Python’s standard stream objects have these methods added already.

Downloading Webpages

Unfortunately the urlopen function from urllib2 does not come with a context manager, so I had to write one myself. Fortunately, it’s easy.

class Url():
  def __init__(self, url):
    self.url = url

  def __enter__(self):
    self.stream = urllib2.urlopen(self.url)
    return self.stream

  def __exit__(self, type, value, traceback):
    self.stream.close()

Tying it All Together

Using the XML Parser

def generateRsses(feedFile):
  with open(feedFile, "r") as file:
    urls = [url.strip() for url in file.readlines()]

  for url in urls:
    with Url(url) as rss:
      handler = RssHandler()
      parser = xml.sax.make_parser()
      parser.setContentHandler(handler)
      parser.parse(rss)
      yield handler.channels

def printFeed(rss):
  for channelName in rss.keys():
    print "*** " + channelName + " ***"
    for title in rss[channelName]:
      print "\t" + title

The most complex part of this is the actual parsing of the xml, I’ll walk you through it line by line.

  1. Create the RssHandler (the class created above). This defines how the parser will work.
  2. make_parser creates an object that does the XML heavy lifting.
  3. Attach the handler to the parser.
  4. Do the actual parsing.
  5. Extract the resulting data.

Commandline Arguments

if __name__ == "__main__":
  [scriptName,feedFileName] = sys.argv
  for rss in generateRsses(feedFileName):
      printFeed(rss)

The only new element here is sys.argv. sys.argv contains all of the arguments handed into the python interpreter. The first is the name of the script itself, so the program ignores it, the second is the name of the file that contains a list of RSS feeds.

Resources: