Thursday, 18 December 2008

Scala for small throw-away scripting tasks

I've come to use Scala for tiny scripts to be thrown away after doing some small task. Typically this involves processing a few files, comparing some textual data, maybe extracting some fields of tab-separated files, etc. The kind of things that Perl used to be the obvious choice for.

Although lacking Perl's simplified syntax for iterating over all lines in files, Scala works quite nicely for small tasks.

For example, today I had to extract from a file all lines of four or more characters including only upper-case characters, and capitalize the output:

scala.io.Source.fromFile(args(0))
.getLines.map(_.stripLineEnd).filter(_.matches("[A-Z]{4,}"))
.map(_.toLowerCase.capitalize).foreach(println)
Not exactly a thing of beauty, but it only took a minute and it works. And it reminds me a bit of a classic Unix command line pipeline.

A few things on my wish-list to make Scala even better for small scripts:
  • A nicer way of setting the output character encoding (currently you have to do something like Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8")))
  • It would be great if Source.getLines could remove the new line character of each line
  • A better name for RichString.stripLineEnd (for some reason, it is totally impossible for me to remember the name of this method)
  • Maybe scripting support in the Scala Netbeans plugin? (Currently, I think the plugin wants you to put your code in a class/object)

Friday, 12 December 2008

Scala: Reading a tab separated file into a Map (first attempt)

Below is my first attempt, in Scala, at reading a tab separated file into a map, where the first and second fields of the input file make up the key-value pairs.

There are probably better ways of doing it, but the following seems to work:

val keyValuePairs = scala.io.Source.fromFile(inputFileName, "UTF8")
.getLines.map(_.stripLineEnd.split("\t", -1))
.map(fields => fields(0) -> fields(1)).toList

val map = Map(keyValuePairs : _*)

The keyValuePairs:_* stuff is a way to call a variable length argument, the constructor of (the immutable) Map, with a list (keyValuePairs).

I'm pretty sure that there are neater ways of doing it. Furthermore, the above snippet does not do any sensible error checking or input validation (such as skipping empty line, for instance).

Thursday, 11 December 2008

Intelligent Software: Netbeans (or JUnit?) can count to three!

I just noticed a (very) small detail in Netbeans. I was adding some unit test, when I noticed that Netbeans can count to, at least, three.

When running a JUnit test suite of only one test, you get the message "The test passed". After adding another test, the message is "Both tests passed", then "3 tests passed", etc. (Well, of course, given that the tests pass.)

Now, that's what I call (artificial) intelligence.

Here's an unrelated article on counting to three (and more).

Tuesday, 9 December 2008

Scala: Beware of inadvertently shadowing variables

I've just spent 15 minutes looking for a stupid mistake in some Scala code. The problem was that I had shadowed a variable.

In some situations in Scala, you are allowed to shadow variables. In other words, it is sometimes legal to give a new variable the same name as an existing one. This can lead to mistakes. The following legal code illustrates how you can shadow a method input variable:

def theShadow(list :Array[String]) : Seq[String] = {
// Mistake! Inadvertently
// shadowing the input parameter:
val list = List("Asa", "nisi", "masa")
list
}


(The above is a very obvious example. When you make this mistake in real code, it will probably be in a less obvious context.)

Scala: XML serializer adds closing elements to empty elements

When printing Scala XML nodes/elements, closing tags for empty elements are added, even if there weren't any in the input.

For example, if you input <childless/>, the XML processor will add a closing tag like this:

scala> val elem = <childless/>
elem: scala.xml.Elem = <childless></childless>

(The two versions of the XML element are equivalent, but sometimes it is practical to be able to do a simple string comparison of the input and output XML files. The added closing tags may make this harder.)


See this thread.

Friday, 5 December 2008

Scala: Problems using the XML API

I've encountered some problems using the Scala XML API. The first one had to do with scala.xml.XML.loadFile throwing away comment nodes of the input XML file.

A helpful person on the scala-user list suggested instead using scala.xml.parsing.ConstructingParser.fromFile. This worked nicely, keeping the comment elements of the input file intact. However, when processing larger XML files, this approach did not work well, resulting in out of memory exceptions.

Finally, I got yet a helpful answer on the scala-user list, this time in the form of some code, translating Java XML nodes into the Scala equivalents.

If you get into the same trouble as I did, you may want to take a look at this code snippet posted on the scala-user list by David Pollak. (You might have to change the code a bit to suit your needs, though.)

Yet a problem I've encountered: you might be hit by a performance problem when extracting child nodes of a large Elem using the \\ or \ operators. (The fix seems to be to loop over the child nodes instead.)

Summary: The current Scala XML API may not work flawlessly if you both want to process rather large documents and at the same time keep all the information of the original input XML file... but it works fine if you write your own XML file reader (see link above) and are careful with the use of \\ or \ on large Elems.


Here's an earlier post on Scala XML processing.