Virtuous Programmer Adventures of an Autodidact

20Dec/100

Scala 3/4: XML

This is the third article in a series, to read the whole series go here.

Writing an RSS Aggregator with Scala

One of the cooler things about Scala is its XML processing and production features. Scala has several powerful mechanisms for creating DSL's (Domain Specific Languages) which are essentially special purpose languages inside the parent language. The details of how it goes about this are beyond the scope of this article, but I'll show you one way you can enjoy some of the results.

To illustrate how Scala can be used to process and generate XML, I'll be putting together a simple RSS aggregator. It will print to the screen and write an html file with the channel titles and article titles.

Files Used in this Project

Libraries

  1. import scala.xml._
  2. import java.net.URL
  3.  
  • scala.xml: Scala's XML library has a vast number of features which I've only begun to sample.
  • java.net.URL: Java's library for URL's and HTTP connections.

I've mentioned that Java libraries can be called as though they were Scala libraries elsewhere, this is the first time I've actually done it. It's painless as advertised.

Main Functions, Arrays and Command Line Arguments

  1. object RssReader {
  2.   def main(args : Array[String]) : Unit = {
  3.     val rssFeeds = combineFeeds(getRssFeeds(args(0)))
  4.     printFeeds(rssFeeds)
  5.     writeFeeds(rssFeeds)
  6.   }
  7.  

One piece of syntax that is especially confusing in Scala is how it indicates that one type contains another, in this case Array[String]. This syntax looks a lot like what you'll see for array indexing in most other languages. In Scala arrays are treated as functions that receive an integer argument to return their values, eg args(0) returns the first argument of args. This is something of a simplification, but it helps me remember what the syntax is.

Anonymous Functions, Maps and Folds

  1.   def getRssFeeds(fileName : String) : List[(String, Seq[String])] = {
  2.     // Given a config file containing a list of rss feed URL's,
  3.     // returns a list with ('channel name', ['article 1', 'article 2'])
  4.     var baseList : List[(String, Seq[String])] = List()
  5.     getFileLines(fileName).map(extractRss).foldLeft(baseList)
  6.         { (l, r) => l ::: r.toList }
  7.   }
  8.  

Anonymous Functions

The function on line 4, (l, r) => l ::: r.toList is a case of an anonymous function in Scala, I'll break it down: 1. (l, r) =>: The arguments section, if I wanted to be pedantic I could specify their type, but anonymous functions already clutter code and their purpose should usually be pretty obvious. 2. l ::: r.toList: l is an already existing list object which is being prepended to r after r is turned into a list.

Higher-order functions

Scala's for syntax is extraordinarily powerful, but sometimes it's more than you need. The three most common higher-order functions are:

  • map: Apply a function to each element in a sequence.
  • filter: Use a boolean function to filter the elements of a sequence.
  • fold(generally split into foldLeft and foldRight): Use a 2-arity function to combine all of the elements in a sequence into 1 element.

The for syntax combines the functionality of map and filter, so is most useful when you need to do both to a sequence. It does not cover the functionality of fold. Here I'm:
1. Getting a sequence which contains the URL's of several RSS feeds. 2. Mapping a extractRss over them to get the RSS data. 3. Folding the anonymous function (l, r) => l ::: r.toList over them to change them from immutable Seq values to Lists and combine them.

Filter and Some Sugar

  1.   def combineFeeds(feeds : List[(String, Seq[String])])
  2.       : List[(String, Seq[String])] = {
  3.      // Cleanup function, given a list of feeds it combines duplicate channels.
  4.     def combinedFeed(feedName : String) : Seq[String] = {
  5.       feeds.filter(_._1 == feedName).map(_._2).flatten
  6.     }
  7.  
  8.     val feedNames = feeds.map(_._1).distinct
  9.  
  10.     feedNames.map(x => (x, combinedFeed(x)))
  11.   }
  12.  

The filter function in combineFeed makes use of the fact that it is in combineFeeds's namespace and doesn't need to have feeds passed to it. It filters through feeds, keeping those with the same name. It then passes the result to map which extracts the article lists, and flatten combines each list. Generally you would either pass a named function with a single argument to filter and map, or an anonymous function with the format (x) => x == feedName. This pattern comes up so often that Scala allows us to ignore the usual anonymous function syntax and use the underscore as a placeholder for a single variable.

Brackets are Optional in Single Liners

  1.   def getFileLines(fileName : String) : Array[String] =
  2.     // Extracts the URL's from a file and breaks them up into individual strings.
  3.     scala.io.Source.fromFile(fileName).mkString.split("\n")
  4.  

Opening Network Connections and Reading XML

  1.   def extractRss(urlStr : String) : Seq[(String, Seq[String])] = {
  2.     // Given a URL, returns a Seq of RSS data in the form:
  3.     // ("channel", ["article 1", "article 2"])
  4.     val url = new URL(urlStr)
  5.     val conn = url.openConnection
  6.     val xml = XML.load(conn.getInputStream)
  7.  
  8.     for (channel <- xml \\ "channel")
  9.       yield {
  10.         val channelTitle = (channel \ "title").text
  11.         val titles = extractRssTitles(channel)
  12.         (channelTitle, titles)
  13.     }
  14.   }
  15.  

Opening Network Connections

For practical purposes Lines 2-4 are straight Java. They format a URL, open a connection, download the stream and convert it into the XML library's internal format.

Reading the XML

Line 6 is where I start making good on my promise of easy XML processing. xml \\ "channel" descends through the XML tree and extracts every element named "channel", we then use the for expression to take each of those elements and treat them each as XML trees in their own right. The \\ will descend through the entire tree, while \ only examines those nodes that are immediately under the given root.

  1.   def extractRssTitles(channel : Node) : Seq[String] = {
  2.     // Given a channel node from an RSS feed, returns all of the article names
  3.     for (title <- (channel \\ "item") \\ "title") yield title.text
  4.   }
  5.  

In extractRssTitles you can see how \\ can be nested to do more complex dives into an XML document.

Display

To Screen

  1.   def printFeeds(feeds : List[(String, Seq[String])]) : List[Seq[Unit]] = {
  2.     // Given a list of ("channel", ["article 1", "article 2"]), prints
  3.     //  them each to screen.
  4.     for (feed <- feeds) yield {
  5.           println("*** " + feed._1 + " ***")
  6.           for (title <- feed._2) yield println("\t" + title)
  7.     }
  8.   }
  9.  

printFeeds takes a list of the feed objects and displays them on screen. There are a couple of ways to extract data from a tuple. If you're dealing with a complex set of structures, you can use match/case. I knew I'd always be dealing with a 2-tuple, so I chose to use the _# syntax, where _1 will give you the first value in the tuple, etc.

To HTML

  1.   def writeFeeds(feeds : List[(String, Seq[String])]) = {
  2.     // Given a list of ("channel", ["article 1", "article 2"]), generates
  3.     //  and writes an HTML document listing the articles with channel names
  4.     //  as headers.
  5.     val html =
  6.       <html><title>Rss Feeds</title><body>
  7.         {for (feed <- feeds) yield
  8.           <h2>{feed._1}</h2>
  9.           <ul>
  10.             {for (title <- feed._2) yield <li>{title}</li>}
  11.           </ul>}
  12.         </body> </html>
  13.    
  14.     XML.saveFull("rss.html", html, "UTF-8", true, null)
  15.   }
  16. }
  17.  

When you produce XML in most languages you have to choose between using hard to read code that produces guaranteed well formed XML, or generating your XML out of strings that are easy to read but you can't be sure is well formed.

This is what has had me excessively excited about Scala for the past week. It seems that the majority of projects I get involved in sooner or later involve doing some kind of work with XML, and it's usually one of the ugliest parts of the job. Here you can see I have XML freely interspersed with the rest of my code, and was by far the easies part of what I wrote this week. At any time inside the XML, if I need to break back out into Scala I just surround the statement I want to use in curly braces. With it I've been able to clearly produce XML that I can be confident is well formed.

My Experience

Arrays

An unpleasant surprise I got while I was working on this program was that, unlike other sequence types, Arrays in Scala have a simple way to convert to Lists.

Functional Programming

Reading over my own code, I realize that functional style has become a habit. It'll be interesting to see how well I cope when I try to pick up a language where functional style isn't a realistic option. To anyone reading, if there are places in my Scala code where it would be more clear if I used a more object oriented style, please let me know.

XML

One more time. The XML processing capabilities of this language are beautiful. Scala is going to end up being part of my standard arsenal for this reason alone.

Resources:

Posted by Frank Berthold

Filed under: xml, rss, scala Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.