Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

Tuesday, 9 December 2008

Scala: XML serializer adds closing elements to empty elements

When printing Scala XML nodes/elements, closing tags for empty elements are added, even if there weren't any in the input.

For example, if you input <childless/>, the XML processor will add a closing tag like this:

scala> val elem = <childless/>
elem: scala.xml.Elem = <childless></childless>

(The two versions of the XML element are equivalent, but sometimes it is practical to be able to do a simple string comparison of the input and output XML files. The added closing tags may make this harder.)


See this thread.

Friday, 5 December 2008

Scala: Problems using the XML API

I've encountered some problems using the Scala XML API. The first one had to do with scala.xml.XML.loadFile throwing away comment nodes of the input XML file.

A helpful person on the scala-user list suggested instead using scala.xml.parsing.ConstructingParser.fromFile. This worked nicely, keeping the comment elements of the input file intact. However, when processing larger XML files, this approach did not work well, resulting in out of memory exceptions.

Finally, I got yet a helpful answer on the scala-user list, this time in the form of some code, translating Java XML nodes into the Scala equivalents.

If you get into the same trouble as I did, you may want to take a look at this code snippet posted on the scala-user list by David Pollak. (You might have to change the code a bit to suit your needs, though.)

Yet a problem I've encountered: you might be hit by a performance problem when extracting child nodes of a large Elem using the \\ or \ operators. (The fix seems to be to loop over the child nodes instead.)

Summary: The current Scala XML API may not work flawlessly if you both want to process rather large documents and at the same time keep all the information of the original input XML file... but it works fine if you write your own XML file reader (see link above) and are careful with the use of \\ or \ on large Elems.


Here's an earlier post on Scala XML processing.

Saturday, 10 May 2008

The Scala programming language and XML

The Scala programming language is a combined scripting and "proper" language, that sits on top of the Java VM. You can either run scripts similar to how you run a Ruby or Perl script, or compile your Scala classes to Java bytecode. You run a Scala application similar to how you run a Java application. You can also run a Scala application using the Java VM (but you have to add the Scala library jar file to your class path). You can mix Java and Scala programs, calling Scala objects from Java, and vice versa.

Scala has a feature that I have never seen in a language suitable for general programming: XML (processing) as a feature of the language. The people behind Scala has added XML to the syntax of the language itself. You do not have to load some library or use some special API for processing XML, since it's already part of the language.

It is not only that XML is valid in Scala code, but XML has its own built-in data types. For instance,

val xml = <vegetable>potato</vegetable>
is a valid Scala statement. In other words, XML-elements written in a Scala program are just not merely strings. The xml object can now be manipulated in various ways, much like a DOM object in Java (but with less hassle than in Java).

You can refer to variables in your XML:
val veg = "potato"
val col = "white"
val xml = <vegetable colour={col}>{veg}</vegetable>

// The value of xml now corresponds to
// <vegetable colour="white">potato</vegetable>



You can also embed function/method calls into XML elements. Imagine that you have a method that returns a sequence of n XML elements, like this (you'll need to import scala.xml.NodeSeq and scala.xml.NodeBuffer):

def genNumElems(n :Int) :NodeSeq = {
val result = new NodeBuffer
for(i <- 1 to n) {
result &+ (<number value={i.toString}/>)
}
result
}
(The odd-looking &+ operator means "add".)

You can now embed a call to genNumElems in an XML element, e.g., like this:
val numList = <number_list>{genNumElems(4)}<number_list/>

Printing numList produces:
<number_list>
<number value="1"></number>
<number value="2"></number>
<number value="3"></number>
<number value="4"></number>
</number_list>


If you want nicer output, you can use a PrettyPrinter (that you import from scala.xml._):
val pp = new PrettyPrinter(100,2) // width and indentation
println(pp.format(xml))


Reading and writing XML data/files to and from Scala is easy. The following is a one-liner that reads an XML file given as a command line argument (args(0)) and returns a list of all elements named "tr" that are child elements to any elements called "table" of the XML file:
val trNodes = scala.xml.XML.loadFile(args(0)) \\ "table" \ "tr"
You may print the <tr> elements (with an empty line between each element) thus:
trNodes.foreach(tr => println(tr + "\n"))

The built-in XML support in Scala's syntax and basic libraries are not the most important or interesting features of Scala, but they sure seem to be very useful.

(Incidentally, the table and tr elements above are present in Oocalc's (OpenOffice.org) XML format for spreadsheets.)

Update: It appears that is not always good advice to use scala.xml.XML.loadFile to read an XML document. One reason is that comment elements are lost. For more advanced XML processing, one should turn to scala.xml.parsing.ConstructingParser.fromFile.

Update: You may run into trouble when processing larger XML documents using the second approach. See this comment.