Virtuous Programmer Adventures of an Autodidact

17Jan/110

Prolog 3/4: XML Parsing

Reading, Parsing and Displaying RSS Data in Prolog

This is the third article in a series, to read it from the beginning go here.

The Plan

This week is when I really put the assertion that Prolog is a general purpose language to the test. Most decently rational Prolog programmers will tell you to do things like XML parsing and display through another language, then use Prolog dll's to do the interesting logical stuff. But, it does have the libraries, so let's see how they work.

What to Expect

When this article is complete:

  • You will have:
    • A command-line RSS aggregator.
  • You will know:
    • How to read a text file.
    • How to read from an HTTP stream.
    • How to parse an XML file.
    • How to pretty print text to the screen.

Files Used in this Project

Compiling

To compile "xml.pl", navigate to the directory that contains it on the commandline, then execute:

  1. swipl -g main -o xml.exe -c xml.pl
  2.  

The Code

Libraries Used

  1. :- consult(library(sgml)).
  2. :- consult(library(http/http_open)).
  3.  
  • sgml: The SGML library, which also handles XML.
  • http/http_open: The HTTP library handles server and client code, I just used the small bit that let's me treat an HTTP stream like a file.

Reading a File

  1. readConfig(Lines) :-
  2.   open('feeds.txt', read, ConfigStream),
  3.   read_stream(ConfigStream, Lines).
  4.  
  5. read_stream(ReadStream, Lines) :-
  6.   read_line_to_codes(ReadStream, Line),
  7.   read_stream(ReadStream, Line, Lines).
  8. read_stream(ReadStream, end_of_file, []) :- close(ReadStream), !.
  9. read_stream(ReadStream, OldLine, [Atom | Lines]) :-
  10.   read_line_to_codes(ReadStream, NewLine),
  11.   string_to_atom(OldLine, Atom),
  12.   read_stream(ReadStream, NewLine, Lines).
  13.  

read_stream is a helper function to read multiple lines from a stream. It reads a stream in one line a time and returns/unifies its contents to Lines. Most of it is similar to what you saw in the previous article, recursively pulling off one line at a time, testing if it's the last line and adding it to the list. The two key elements that are worth noting are: the use of the atom end_of_file, which is what the file stream returns when the stream is empty and the use of ! to indicate that if execution makes it to closing the stream, then there's no need to backtrack.

readConfig's functionality is pretty obvious. Open a file using an atom with the file's name. Remember, single-quotes are for atoms, double quotes are for strings. This opens a read only stream and unifies it with ConfigStream, then read_stream pulls the text in the file out into Lines.

Reading From HTTP

  1. rssFetch(URL, XML) :-
  2.   http_open(URL, XmlStream, []),
  3.   load_xml_file(XmlStream, XML),
  4.   close(XmlStream).
  5.  

The pattern here is similar to the one for readConfig, except that it extracts the XmlStream straight to a Prolog XML structure. There's quite a bit to it, but the short version is, it's a nested linked list. There are a variety of functors in that list, but the one that interests us is element which has the form element(<element name>, <attributes>, <sub elements>). For example when given the xml:

  1. <testOuter>
  2.   <inside inAtt="hello">in text</inside>
  3. </testOuter>
  4.  

load_xml_file returns:

  1. [element(testOuter, [],
  2.     ['\n  ', element(inside, [inAtt=hello], ['in text']), '\n'])].
  3.  

Parsing XML

  1. rssRead([], []).
  2. rssRead([element(channel, _, Elements) | _], [Rss | Rsses]) :-
  3.   channelRead(Elements, Rss), rssRead(Elements, Rsses).
  4. rssRead([element(rss, _, Elements) | _], Rss) :-
  5.   rssRead(Elements, Rss).
  6. rssRead([_ | Elements], Rss) :-
  7.   rssRead(Elements, Rss).
  8.  
  9. channelRead(Elements, channel(Name, Titles)) :-
  10.   titleRead(Elements, Name), itemsRead(Elements, Titles).
  11.  
  12. itemsRead([], []).
  13. itemsRead([element(item, _, Elements) | Items], [ Title | Titles ]) :-
  14.   titleRead(Elements, Title), itemsRead(Items, Titles).
  15. itemsRead([_ | Items], Titles) :-
  16.   itemsRead(Items, Titles).
  17.  
  18. titleRead([element(title, _, [Text | _]) | _], Text) :- !.
  19. titleRead([_ | Elements], Text) :-
  20.   titleRead(Elements, Text).
  21.  

Strictly speaking Prolog is a weakly typed language, you have only a couple of basic types which all reduce to atoms and functors. But with those you can do the same tricks that are generally associated to algebraic type systems. Here is one example where I use the element functor to recurse through the XML structure and extract the titles.

Once you know how Prolog's XML structures work, it's just a matter of figuring out what information you need and recursing until you get it. For this purpose I've created my own functor channel(<channel name>, <title list>).

In Prolog functors and atoms are created simply by using them. This makes writing the code feel a lot like freeform sketching a picture. I like it. But I'm pretty sure that unless I made an effort to document my code well, it would make large projects unwieldy.

Pretty Printing Text

  1. displayRss(Channels) :-
  2.   foreach(member(channel(Name, Titles), Channels),
  3.       (writef("*** %w ***\n", [Name]), displayTitles(Titles), nl)).
  4.  
  5. displayTitles(Titles) :-
  6.   foreach(member(Title, Titles), writef("\t%w\n", [Title])).
  7.  

writef works similarly to printf in nearly every language I've worked in. It has a few quirks that are detailed in its docs.

Main And Helper Text

  1. readAndDisplayRss(URL) :-
  2.   catch((rssFetch(URL, XML),
  3.         rssRead(XML, RSS),
  4.         displayRss(RSS)), _, readAndDisplayRss(URL)).
  5.  
  6. main :-
  7.   readConfig(FeedURLs),
  8.   foreach(member(URL, FeedURLs), readAndDisplayRss(URL)),
  9.   halt.  
  10.  

The above is glue code. It's been written to continue retrying until it succeeds or runs out of stack, so make sure your URL's are valid.

Final Summary

When I was reading up on Prolog, I heard predicates described as a generalization on functions. This week has driven that one home. Working in Prolog has felt a great deal like when I was first learning Haskell. I'll spend hours fighting with something that I could do in my sleep in another language, then when I figure it out it is extraordinarily clear, obvious and feels cleaner than anything else I've worked with.

Hard Lessons Learned

foreach is not the same as a for comprehension/list comprehension, no matter how much it looks like one. The key difference is that the second term in the statement does not contain its own namespace. What happens in there happens throughout the same namespace that contains the foreach. For example:

  1. main :-
  2.   foreach(member(Number, [1, 2, 3, 4]),
  3.       (Square is Number * Number,
  4.        write(Square), nl)).
  5.  

Does not print the square of the numbers 1 through 4. It first asserts that Square is equal to 1 * 1, then that it is also equal to 2 * 2, 3 * 3, and 4 * 4, which is clearly false and ends execution of the foreach. It is possible to do this, the safe and idiomatic way is to break the second term out into its own predicate:

  1. showSquare(Number) :-
  2.   Square is Number * Number,
  3.   write(Square), nl.
  4.  
  5. main :-
  6.   foreach(member(Number, [1, 2, 3, 4]), showSquare(Number)).
  7.  

That lesson took me half a day for me to learn. You're welcome.

Coming Up

Next week I'll be working out how to use XPCE, Prolog's native GUI library.

Resources

Posted by Frank Berthold

Filed under: xml, rss, prolog Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.