Thursday, 18 December 2008

Scala for small throw-away scripting tasks

I've come to use Scala for tiny scripts to be thrown away after doing some small task. Typically this involves processing a few files, comparing some textual data, maybe extracting some fields of tab-separated files, etc. The kind of things that Perl used to be the obvious choice for.

Although lacking Perl's simplified syntax for iterating over all lines in files, Scala works quite nicely for small tasks.

For example, today I had to extract from a file all lines of four or more characters including only upper-case characters, and capitalize the output:"[A-Z]{4,}"))
Not exactly a thing of beauty, but it only took a minute and it works. And it reminds me a bit of a classic Unix command line pipeline.

A few things on my wish-list to make Scala even better for small scripts:
  • A nicer way of setting the output character encoding (currently you have to do something like Console.setOut(new,true,"UTF8")))
  • It would be great if Source.getLines could remove the new line character of each line
  • A better name for RichString.stripLineEnd (for some reason, it is totally impossible for me to remember the name of this method)
  • Maybe scripting support in the Scala Netbeans plugin? (Currently, I think the plugin wants you to put your code in a class/object)

Friday, 12 December 2008

Scala: Reading a tab separated file into a Map (first attempt)

Below is my first attempt, in Scala, at reading a tab separated file into a map, where the first and second fields of the input file make up the key-value pairs.

There are probably better ways of doing it, but the following seems to work:

val keyValuePairs =, "UTF8")"\t", -1))
.map(fields => fields(0) -> fields(1)).toList

val map = Map(keyValuePairs : _*)

The keyValuePairs:_* stuff is a way to call a variable length argument, the constructor of (the immutable) Map, with a list (keyValuePairs).

I'm pretty sure that there are neater ways of doing it. Furthermore, the above snippet does not do any sensible error checking or input validation (such as skipping empty line, for instance).

Thursday, 11 December 2008

Intelligent Software: Netbeans (or JUnit?) can count to three!

I just noticed a (very) small detail in Netbeans. I was adding some unit test, when I noticed that Netbeans can count to, at least, three.

When running a JUnit test suite of only one test, you get the message "The test passed". After adding another test, the message is "Both tests passed", then "3 tests passed", etc. (Well, of course, given that the tests pass.)

Now, that's what I call (artificial) intelligence.

Here's an unrelated article on counting to three (and more).

Tuesday, 9 December 2008

Scala: Beware of inadvertently shadowing variables

I've just spent 15 minutes looking for a stupid mistake in some Scala code. The problem was that I had shadowed a variable.

In some situations in Scala, you are allowed to shadow variables. In other words, it is sometimes legal to give a new variable the same name as an existing one. This can lead to mistakes. The following legal code illustrates how you can shadow a method input variable:

def theShadow(list :Array[String]) : Seq[String] = {
// Mistake! Inadvertently
// shadowing the input parameter:
val list = List("Asa", "nisi", "masa")

(The above is a very obvious example. When you make this mistake in real code, it will probably be in a less obvious context.)

Scala: XML serializer adds closing elements to empty elements

When printing Scala XML nodes/elements, closing tags for empty elements are added, even if there weren't any in the input.

For example, if you input <childless/>, the XML processor will add a closing tag like this:

scala> val elem = <childless/>
elem: scala.xml.Elem = <childless></childless>

(The two versions of the XML element are equivalent, but sometimes it is practical to be able to do a simple string comparison of the input and output XML files. The added closing tags may make this harder.)

See this thread.

Friday, 5 December 2008

Scala: Problems using the XML API

I've encountered some problems using the Scala XML API. The first one had to do with scala.xml.XML.loadFile throwing away comment nodes of the input XML file.

A helpful person on the scala-user list suggested instead using scala.xml.parsing.ConstructingParser.fromFile. This worked nicely, keeping the comment elements of the input file intact. However, when processing larger XML files, this approach did not work well, resulting in out of memory exceptions.

Finally, I got yet a helpful answer on the scala-user list, this time in the form of some code, translating Java XML nodes into the Scala equivalents.

If you get into the same trouble as I did, you may want to take a look at this code snippet posted on the scala-user list by David Pollak. (You might have to change the code a bit to suit your needs, though.)

Yet a problem I've encountered: you might be hit by a performance problem when extracting child nodes of a large Elem using the \\ or \ operators. (The fix seems to be to loop over the child nodes instead.)

Summary: The current Scala XML API may not work flawlessly if you both want to process rather large documents and at the same time keep all the information of the original input XML file... but it works fine if you write your own XML file reader (see link above) and are careful with the use of \\ or \ on large Elems.

Here's an earlier post on Scala XML processing.

Wednesday, 19 November 2008

Scala: New Netbeans 6.5 plugin

There is a new version of a Netbeans 6.5 plugin for Scala programming.

The Scala plugin already seems quite useful, and it's getting better and better for each new version.

Check it out here. This is a link to the blog of the author of the plugin.

By the way, Netbeans 6.5 was just released too.

Tuesday, 18 November 2008

Scala: The Map += method expects a Tuple: += ((k, v))

In Scala, you use the += method to add a key-value pair to a Map. The key-value pair should be in the form of a Tuple, or a Pair. You can use different syntax for such pairs: ("year", 2008), "year" -> 2008, Tuple2("year", 2008) or Pair("year", 2008):

scala> ("year",2008) == "year" -> 2008
res0: Boolean = true

scala> "year" -> 2008 == Pair("year", 2008)
res1: Boolean = true

scala> Pair("year", 2008) == Tuple2("year", 2008)
res2: Boolean = true

Thus, a few different but equal ways of adding a key-value pair to a Map:

scala> val map = new scala.collection.mutable.HashMap[String,Int]
map: scala.collection.mutable.HashMap[String,Int] = Map()

scala> map += (("year",2008)) //Notice the parentheses
scala> map += ("year" -> 2008)
scala> map += Pair("year",2008)
scala> map += Tuple2("year", 2008)

However, this one fails, because of missing parentheses:

scala> map += ("year",2008)
:6: error: type mismatch;
found : java.lang.String("year")
required: (String, Int)

You can check out, e.g., this and this thread on the Scala mailing list.

Monday, 10 November 2008

Scala: Converting Java collections into their Scala counterparts

In the scala.collection.jcl library, you'll find Scala wrappers, adding Scala methods to Java collections. This means that a Java collection (e.g., an ArrayList) will be converted to work as a Scala collection, making it possible to call foreach on a ArrayList, etc:

import scala.collection.jcl.Conversions._

val a = new java.util.ArrayList[String]

// foreach now works on a Java List:
Simlarily, you can now call .mkString on a Java list:
// Let's use mkString to print the
// ListArray contents as a Prolog spell/3 fact:

println(a.mkString("spell('", "', '", "')."))

// -> spell('Asa', 'nisi', 'masa').

See this Scala mailing list thread.

Scala: You cannot run a companion object as a stand-alone program

Update: In Scala 2.8, the below is no longer true. A companion object can now work as the entry point of an application.


In the Scala programming language, a companion object is an object with the same name as a class in the same source file. (Scala's companion objects can be used similar to Java's static methods.)

An object definition on its own can function as the entry point for running a Scala program. Compiling and running this object works fine:

object heyYouTheRocksteadyCrew{
def main(args :Array[String]) {
println("Make a break!")


However, if you try to run the same object when it is a companion object to a class with the same name, this will result in an exception:

class heyYouTheRocksteadyCrew{}

object heyYouTheRocksteadyCrew{
def main(args :Array[String]) {
println("Make a move!")


The above is true of the current release, (Until this is fixed, these guys will not be too happy about any stupid heyYouTheRocksteadyCrew-exception...!)

There is at least one thread about the above on the Scala mailing list.

Monday, 27 October 2008

New version of FlameRobin, Admin GUI for Firebirdsql

FlameRobin 0.9.0 is released

There is a new version of FlameRobin, a GUI for creating and managing Firebird databases. This version introduces tabbed browsing. In earlier versions of FlameRobin, you would end up with an unmanageable large number of open windows after a short while. This problem is now largely gone. (However, not everyone found this to be a problem --- see one of the comments below.)

I've mostly used FlameRobin for inspecting existing databases, for some minor editing of (existing) stored procedures, and for querying the database. The overall judgment is that it is a fine and very useful piece of software.

Update: There is no pre-compiled version for 64-bit Linux, so if this is what you need, you're on your own... (we gave up on compiling from source after 15 minutes of library dependency hell). This was on a Ubuntu machine.

Update of the update: Apparently, there was already a pre-compiled 64-bit Ubuntu version available! To find the information about the Ubuntu repository, you should not go to the download page that the announcement of the new version points to, you should go here. Thanks mariuz, for pointing this out (see comment below).

Friday, 5 September 2008

Scala: String vs RichString oddities

Update: In Scala 2.8, the below is no longer true. String.reverse now returns a String rather than a RichString:

scala> "a".reverse == "a"
res0: Boolean = true


In the Scala programming language, there is a class called RichString, that adds features to the underlying Java String. In the current version of Scala (, this leads to some odd behaviour:
"Im a string" == "Im a string".reverse.reverse
returns false, while
"Im a string" == "Im a string".reverse.reverse.toString
returns true!

Just to make your head spin, the following code does indeed work as expected:
val str :String = "Im a string".reverse.reverse
println(str == "Im a string") // prints "true"
val str = "Im a string".reverse.reverse
println(str == "Im a string") // prints "false"
does not.

The explanation is that String.reverse returns a RichString, and that == returns false when comparing a String and a RichString, even though it is the "same" string (as in the example above).

If I understand it correctly, this oddity will be fixed in future releases of Scala.

(And no, Scala's == is not the same as Java's ditto. It means "equal objects" rather than "refers to the same instance of an object".)

Scala mailing list item here.

Case insensitive pattern matching of Unicode strings in Java

To make case insensitive pattern matching of Unicode strings in Java, you can call Pattern.compile with a second argument, like this:

Pattern p = 
Pattern.compile(patternString, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

(This is useful when dealing with non-ASCII/non-Latin1 text, such as Cyrillic. However, it may not work flawlessly for the Turkish Unicode characters.)

Update: I just learned that there is a nicer way of doing this: start the patternString above with "(?iu)":
Pattern p = 
Pattern.compile("(?iu)"+ patternString);

Saturday, 30 August 2008

Scala and implicit conversion: Turning a string into pure Weirdness

In the Scala programming language, you can turn water into wine, or vice versa, using implicit conversion.

Imagine that you have a class called Weird:

class Weird(s :String) {
def imWeird :String = {
"I'm "+ s +" and I'm weird!"

It consists of merely a string, s, and a method, imWeird, that returns a jolly message containing the very same string. (Thus, the code
val freak = new Weird("a freak")
outputs I'm a freak and I'm weird!.)

Now, Scala allows you to create an implicit conversion that adds the method(s) of Weird to any other class. Or rather, turns an object into a Weird whenever one calls Weird's methods (functions) on the given object.

For example, the following implicit conversion
  implicit def string2Weird(s: String) = new Weird(s)
makes it possible to call Weird's method(s) on a String. This code
val happy = "Happy"
will now output
  I'm Happpy and I'm weird!

The name of the implicit conversion method, string2Weird, is arbitrary.

Friday, 18 July 2008

Turning the lights on (and off) from your computer

I just bought a Telldus Technologies TellStick. It is a wireless USB-device that can be used for, e.g., turning on and off the light (if it is equipped with a suitable receiver).

The TellStick software is free, and it works under Linux. It even comes with a Java API, and a NetBeans project with some sample code. However, to get the Java binding to work, you have to install some strange libraries (rxtx).

You can controll any number of recievers using your TellStick.

Next time a program throws an exception, it will be able to turn on (or off) a lamp somewhere in the office...

Wednesday, 21 May 2008

Saldo 1.0: Large, freely available Swedish morphologic and semantic lexicon

Språkbanken has published a large, freely available Swedish lexicon, Saldo, "a Swedish basic language resource". The release appears to include some 68,000 uninflected lemma forms as well as more than 740,000 expanded (full) word forms. There is morphologic and semantic information.

This resource should be valuable for part-of-speech tagging, lemmatizers, spell-checking, (semantic) analysis of Swedish text, etc, etc.

The release includes software for interfacing with the lexicon. (Those of you into functional programming might be interested in the fact that the lexicon software is written in Haskell.)

It is released under the LGPL license.

Friday, 16 May 2008

Scala one-liner for upcasing lines of text

The following is a Scala script that up-cases each line of an UTF8 encoded input file (args(0)) and prints the result to standard output:



Source.fromFile(args(0), "UTF8").getLines.foreach(line => print(line.toUpperCase))

If you're trusting the default character encoding to work for you, you may reduce it to:


Source.fromFile(args(0)).getLines.foreach(line => print(line.toUpperCase))

Another way to do it, is to read the lines into an iterator, using the iterator's .map method to upcase each line:


val lines = Source.fromFile(args(0))


A Java programmer may be relieved (or horrified) to learn that Scala does not have any checked exceptions. There are only runtime exceptions, and you don't need to add any try/catch statements if you don't want to.

When you run a Scala script, you can instruct the Scala interpreter to compile the script, and use the compiled version (a jar file) if it's younger than the source-file. This gives better performance (shorter start-up, etc). You use the savecompiled command line argument.

Saturday, 10 May 2008

The Scala programming language and XML

The Scala programming language is a combined scripting and "proper" language, that sits on top of the Java VM. You can either run scripts similar to how you run a Ruby or Perl script, or compile your Scala classes to Java bytecode. You run a Scala application similar to how you run a Java application. You can also run a Scala application using the Java VM (but you have to add the Scala library jar file to your class path). You can mix Java and Scala programs, calling Scala objects from Java, and vice versa.

Scala has a feature that I have never seen in a language suitable for general programming: XML (processing) as a feature of the language. The people behind Scala has added XML to the syntax of the language itself. You do not have to load some library or use some special API for processing XML, since it's already part of the language.

It is not only that XML is valid in Scala code, but XML has its own built-in data types. For instance,

val xml = <vegetable>potato</vegetable>
is a valid Scala statement. In other words, XML-elements written in a Scala program are just not merely strings. The xml object can now be manipulated in various ways, much like a DOM object in Java (but with less hassle than in Java).

You can refer to variables in your XML:
val veg = "potato"
val col = "white"
val xml = <vegetable colour={col}>{veg}</vegetable>

// The value of xml now corresponds to
// <vegetable colour="white">potato</vegetable>

You can also embed function/method calls into XML elements. Imagine that you have a method that returns a sequence of n XML elements, like this (you'll need to import scala.xml.NodeSeq and scala.xml.NodeBuffer):

def genNumElems(n :Int) :NodeSeq = {
val result = new NodeBuffer
for(i <- 1 to n) {
result &+ (<number value={i.toString}/>)
(The odd-looking &+ operator means "add".)

You can now embed a call to genNumElems in an XML element, e.g., like this:
val numList = <number_list>{genNumElems(4)}<number_list/>

Printing numList produces:
<number value="1"></number>
<number value="2"></number>
<number value="3"></number>
<number value="4"></number>

If you want nicer output, you can use a PrettyPrinter (that you import from scala.xml._):
val pp = new PrettyPrinter(100,2) // width and indentation

Reading and writing XML data/files to and from Scala is easy. The following is a one-liner that reads an XML file given as a command line argument (args(0)) and returns a list of all elements named "tr" that are child elements to any elements called "table" of the XML file:
val trNodes = scala.xml.XML.loadFile(args(0)) \\ "table" \ "tr"
You may print the <tr> elements (with an empty line between each element) thus:
trNodes.foreach(tr => println(tr + "\n"))

The built-in XML support in Scala's syntax and basic libraries are not the most important or interesting features of Scala, but they sure seem to be very useful.

(Incidentally, the table and tr elements above are present in Oocalc's ( XML format for spreadsheets.)

Update: It appears that is not always good advice to use scala.xml.XML.loadFile to read an XML document. One reason is that comment elements are lost. For more advanced XML processing, one should turn to scala.xml.parsing.ConstructingParser.fromFile.

Update: You may run into trouble when processing larger XML documents using the second approach. See this comment.

Sunday, 27 April 2008

Bye bye, Ubuntu, Hello Debian

Sadly, the new version of Ubuntu, 8.04, didn't accept my laptop (an a few years old Acer TravelMate 290, without any strange hardware). I couldn't find any information on how to resolve the problems I ran into, so I had to ditch Ubuntu, and replace it with Debian 4.0.

This was a pity, since the new Ubuntu looked quite promising. The install is incredible easy and rather quick. Apart from the desktop background image, the new system looks and feels good. They appear to have made good choices when it comes to the pre-installed software. But this doesn't help when Ubuntu fails to shutdown the computer properly.

Installing Debian is not as straightforward, but still not very hard. It took a little longer, mostly because I used the net installer that grabs the software packages from the internet and not from the installation CD-ROM.

However, compared to Ubuntu, it takes some more fixing after the installation to get a system that your are comfortable with. For instance, the Debian people appear to think that you should prefer a web browser called Epiphany to Firefox... They don't even offer you the standard Firefox browser, but their own version, "Iceweasel". (There seems to be a totally silly reason as to why Firefox is not called Firefox.)

Worse, the default fonts did not look good on my laptop, so I had to install new fonts (by running apt-get install msttcorefonts, I think?).

A bit surprisingly, Debian supports playing mp3 files without installing additional libraries.

After a bit of tweaking, Debian feels nice. Still, I would prefer a working version of Ubuntu. It would be interesting to know what went wrong in the relationship between Ubuntu 8.04 and my laptop.

Friday, 25 April 2008

Ubuntu 8.04 revitalised my laptop, but I'm still not happy...

Ah, I just did a clean install of Ubuntu 8.04 (aka "Gniffly Gnaffly") on my Acer laptop. The net update failed, so I had to burn an install CD. I didn't mind, however, since last time I did an upgrade of the laptop, something strange happened, and it became incredibly slow.

With Ubuntu 8.04 installed, the laptop is back on track. It starts fast, and everything (that I care about) seems to work.

The first things to do after install, are to change the desktop background image (the default depicts an oil-drenched dead bird?), turn off the system sounds including the beep and turn off all visual effects.

Update: No! Ubuntu still doesn't behave well. It turns out, that sometimes when I turn off the laptop, it isn't turned off correctly! Ubuntu goes down, the screen gets black, but still, the laptop is not properly turned off (both the indicator that the computer is on, and the indicator that the hard drive is working keep glowing...). Gah.

Furthermore, I've noticed some instances of ill-boding flickering of the screen.

Maybe it is time to go back to Debian.

Tuesday, 22 April 2008

Keeping empty fields when splitting tab separated lines in Java

Frequently, I process text files containing tab separated data. Sometimes these have empty columns, i.e., two or more tabs without any data between them. More often than not, I want to keep the empty fields. However, Java's String.split defaults to removing empty fields.

This is what you do to keep the empty fields:

String[] fields = string.split("\t", -1)

In the following example, the test string tst will be split into zero parts (result1) and four parts (result2) respectively:
String tst = "\t\t\t";
String[] result1 = tst.split("\t"); //result1.length == 0
String[] result2 = tst.split("\t", -1); //result2.length == 4
result2 will contain four instances of the empty string ("").

The same thing goes when you split a string using a pre-compiled regular expression:
Pattern pattern = Pattern.compile("\t");
String[] result3 = pattern.split(tst); //result3.length == 0
String[] result4 = pattern.split(tst, -1); //result4.length == 4

By the way, I compared the performance of the two variants above (String's split and a pre-compiled pattern matching a tab). Luckily, the difference in performance was negligible, the compiled pattern winning with a small margin. When the split pattern is more complicated, I would expect bigger performance differences between compiled and uncompiled regular expressions. (Running Sun's java command with and without the server argument made a big difference, however. The default client was significantly slower.)

Tuesday, 8 April 2008

Book: Dreaming in Code by Scott Rosenberg

Title: Dreaming in Code - Two dozen programmers, three years, 4,732 bugs, and one quest for transcendent software
Author: Scott Rosenberg
ISBN: 978-1-4000-8247-6

This book reports how a well-funded, ambitious software project failed. For several years, the author followed the work of a group of developers employed to produce a revolutionary piece of software. The non-profit organisation set up for the task did not have any customers, but was funded by an idealist with deep pockets. Many of the people involved appear to be quite experienced and knowledgeable. The project is called Chandler, and the open source organisation is called OSAF.

The theme of the book is that producing software is complicated and that it is hard to predict whether a project will succeed or fail, and that there is no obvious solution to this problem. The book is aimed at non-programmers. The first 80 or so pages give a background on software development intertwined with the story of the software project. It takes a while for the book to get going. If you are familiar with programming and different programming concepts, you might find some of these pages less interesting. However, the author manages to explain things such as object oriented programming, open source, scripting languages, the halting problem, etc, for a non-programmer and without simplifying too much.

The author appears to be interested in his subject, and understands it well. He is also a good writer. The author does a good job of explaining and exemplifying how hard software development can be. (Still, I cannot keep from thinking, that with a tenth of the funding of the project he describes, my company could do wonders...) But the text is too long, and the problems of the project he is describing are in an exaggerated way generalised into problems of all software development.

This may have been a hard book to write, since I suspect that at the outset, the author figured that he would be describing what should turn out to be a successful project. Instead he had to describe and explain a failure. This is probably why the book to a large extent discusses different software failures. The author has made quite a lot of research, and describes different methodologies for software development meant to reduce the risk and to ensure the quality of software development.

On the whole the book is interesting, but rather pessimistic concerning the state of software development. While perhaps not an inspiring text, you should be able to pick up a few things to stay clear of, though.

With the help of some more editing, maybe the book could have been a little shorter (and better).

A side note: The book discusses the problem of producing reusable code. This is perhaps not the most central theme of the book, but every time I hear about the failure of software producers to create reusable code, I cannot help but to reflect that almost every day, I reuse code. If you program in Java, for instance, there is a huge set of reusable libraries for almost everything: XML processing, GUI building, cryptography, email, sound, sorting, hash tables, databases... the list of components you do not have to implement yourself, but can use as building blocks for new applications, goes on and on. You can even find software for automatically producing code (some machine learning approaches, for instance). There are programming languages and environments that sit on top of other such software: programming languages reusing other programming languages...! (See for example the Scala or JRuby languages.)
From this perspective, there is a breathtaking amount of (freely) available, high quality software about.

Saturday, 5 April 2008

Automatic Simple backups: SBackup

A colleague told me about a simple backup utility called... Simple Backup (or SBackup). With the help of Simple Backup, you can very easily do hard-drive backups (and restore the backups if needed). If you're using Ubuntu, you will find it with the help of apt-get, Synaptic or under Applications>Add/remove....

At a small office, SBackup may be suitable for doing daily, automatic PC-backups. You configure it to do incremental backups with a frequency of your own choice. You can tell SBackup to put the backups on a remote server through ssh. Notice that the connection settings are in clear text, i.e., your password for ssh-ing will be readable for anyone with access to your computer! (Thus, you should be a bit careful with how you use SBackup.)

A nice feature is that the backups are in tar.gz format, i.e., you can use standard tools to read the backed up files. You can also tell SBackup what local directories to include or exclude from the backups. Once configured, the only thing you may need to care about is that you have enough disk space on the machine that holds the backups. If you do incremental backups of a number of computers, the backups may grow quite large. (However, they will not grow infinitely large, since SBackup can take care of removing old/redundant backups.)

More info here.

Update: Due to a reboot of the target server, SBackup silently stopped doing its backups. It had to do with obsolete ssh keys, most likely. On the Ubuntu clients, the problem can fixed by removing the known_hosts file from the root home directory

sudo rm /root/.ssh/known_hosts
or by removing the same file from the user home
rm ${HOME}/.ssh/known_hosts
(It appears to work differently on different versions of Ubuntu.)

After this, we started the SBackup configuration GUI and tested the destination (you need to answer a question before it works again).

The same goes for moving the backup destination to a different server. SBackup silently stops working. You have to delete the known_hosts file, as above.

Friday, 28 March 2008

No Web Start for 64-bit Sun Java

Sun does not include Java Web Start in its 64-bit version of Java. It appears that Sun thinks that you are not supposed to run Web Start on 64-bit machines, since these mostly are servers (?), and... eh... sorry, I cannot follow their reasoning. Let's hope they change their minds.

I haven't tried it myself, but here is a description of how to run 32-bit Java Web Start on 64-bit Ubuntu.

Update: At the time of writing this, an AMD64 version of Java Web Start is at the top of Sun's Request for Enhancements list.

Update: There will be support for 64-bit Java Web Start in an upcoming release, 1.6.0_12 (I think). Ismael Juma points out that an early access release is available. See his comment below.

Wednesday, 26 March 2008

Frequency list bash function

In addition to command aliases (see an earlier post), you can add your own functions to the bash shell. Here is a simple but useful command line sequence:

function freq() {
sort $* | uniq -c | sort -rn;

Put it in ~/.bashrc and you will have a freq command for creating frequency lists:
freq <FILES>
will sort and count all identical lines of the input file(s), and present them in descending frequency. Useful in many situations, not the least for checking that files that are supposed to only contain unique lines actually do so.

(I'm not too sure about bash function syntax, but the function above seems to do its work.)

If you're not familiar with the different commands of the pipeline above, there is plenty to read (e.g., egrep for linguists).

Tuesday, 25 March 2008

Favourite bash command line aliases

My favourite bash aliases currently are

alias hist='history|egrep'

alias ös='ls'

The second one for the reason that 'ö' sits next to 'l' on my Swedish keyboard, and when I intended to type 'ls' I type 'ös' more often than not. The one I use the most, however, is alias more='m' (I also have the classic more='mroe' and more='moer' to catch some frequent typos).

The first one, hist, makes it possible to use regular expressions to search the history of earlier shell commands. This is useful when you cannot remember some tricky command line sequence, or are too lazy to type some long command that you know you issued the other day.

For instance

hist 'java|ruby'

will print any previous command (in bash's history) containing any of the two strings.

(Well, I think you can accomplish the same thing using the original history command, but to paraphrase Morrissey, now my head is full, and my brain doesn't have room for more cryptic command line arguments.)

You can put your bash aliases in ~/.bashrc.

(Thanks to Chris for spotting a (now corrected) mistake in the first example. See the comment below.)

Update: Hey, checkout the comment by Anonymous below: Ctrl-r seems useful for searching the Bash history!

Thursday, 20 March 2008

Beware of Sun's Java equalsIgnoreCase --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.util.Locale;

public class TestTur
 public static final void main(final String[] args) throws Exception
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));

Now, what do you think the above code prints? You would expect that


is exactly the same as


wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false

I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

Tuesday, 18 March 2008

Firebird vs Postgresql

We have similar databases running on MySql, Postgresql and Firebird. One of the reasons for moving away from MySql was the fact that the UTF8 support didn't work properly. I cannot remember the details, but it had to do with non-Latin-1 data, such as text in Czech or Russian. In some situations MySql refused to correctly identify equal UTF8 strings. You put in some word that you cannot retrieve again, bleh!

Furthermore, we've never understood how the user permissions are supposed to work in MySql (we always end up frantically running all possible variants of the GRANT ALL command).

We moved to Postgresql, which worked a lot better. Now we've started using Firebird, that also seems like a very nice piece of software.

Here is list of a few things I've noticed when moving from Postgresql to Firebird:

* Firebird lacks built-in support for regular expressions. (We make heavy use of complex string searches of natural language data. If we hadn't got help from an expert, who helped us compile some user defined functions, UDF:s, for this purpose, this would have been a show-stopper.)

* Postgres' psql command line tool is better than Firebird's isql(-fb). (If you are a Windows user, see Carlos' comment below)

* Firebird database files grow and grow. This is true even if you delete data. You have to manually back-up and restore a database to reclaim disk space. Maybe this is not a great problem in normal usage, but I noticed that the databases I use for running test suits against keep growing, though the test database itself is quite small (and the data are cleared out between test runs). [Update: Please notice that long-time users of Firebird insist that this is not a problem. See Carlos', Sergio Marcelo's and also Michal's comments below.]

* I've never had any luck installing Firebird from a Debian package. I have had to do a manual install to get it to work

* Firebird has a useful GUI, FlameRobin, that let's you inspect and change your databases. FlameRobin comes with an editor useful for writing/editing stored procedures. The editor has code completion, that helps you with suggestions of table and column names and the like as you type.

* Firebird has a nice way to manage database files: all tables of a database end up in a single file, that you can name whatever you like, and put wherever you like.

* It appears to be easier to find useful documentation for Postgres than for Firebird (but Firebird does have a nice FAQ site)

Answer to Darius Damalakas comment below: I'm not the right person to comment on the performance of the different DBMSs. However, we haven't noticed any significant difference in performance between MySql, Postgresql and Firebird. Currently, the bottlenecks in our software are to be found outside of the databases, so the performance of the individual DBMSs has not been a big concern. They're all fast enough.

Firebird does seem to be a snappy system, and I would be surprised to find it to perform less good than Postgres.

So far, the only difference in features that has mattered to us, is the lack of built-in support for regular expressions in Firebird (see above). In all other respects (of importance to us), the functionality of Postgres and Firebird seems equivalent.

Update: Support for regular expressions is scheduled for the upcoming 2.5.0 release of Firebird.

Update: In response to an anonymous (and rather critical) comment, mariuz has added some useful links in a comment below.

Update: In a comment below, Michal has posted some information on DatabaseGrowthIncrement, taken from the release notes of Firebird 2.1.

Saturday, 15 March 2008

Beware of Firebird 2.0 Debian package

We are migrating an application to the Firebird 2.0 database manager ( Our server runs Debian (AMD64), and we used the Firebird 2.0 (superserver) Debian package as suggested in the Firebird site's FAQ section. However, when the package was installed, it appears to have silently overlooked a dependency, missing a library necessary for getting the "user defined functions", UDF:s, to work correctly. (Firebird didn't find the UDF:s, resulting in runtime errors when calls to the functions were issued from a Firebird database.)

We made sure that Firebird as well as the UDF:s were all compiled for AMD64.

When uninstalling the apt-get Firebird package, and manually installing Firebird 2.0.3 from the standard .tar.gz file, the missing dependency was spotted, and the database could be properly installed. Unfortunately, I didn't keep a record, but it might have been the correct version of libstdc++ that was missing.

As far as I can remember, this is the only time a Debian apt-get package has failed me. In addition to the fact the apt-get install of Firebird might be broken, you have to be careful not to apt-get install "Firebird 2", since this will give you Firebird 1.5! Peculiar. (But see the comment from mariuz below).

I had a similar experience the first time I tried to install Firebird from a Debian package. This was Firebird 1.5 (the Firebird 2 Debian package), before Firebird 2.0 was released. I never got that one to run either, but had to install the tar.gz version obtained from the official Firebird webserver. I can't remember exactly what went wrong at that time, but it was impossible to get the Debian package that we tried at that time to work. The manual install worked perfectly, just as it did this time.

Update: The Debian Firebird2 package (containing Firebird 1.5) appears to be discontinued.

Don't concatenate Java strings using +=

The other day, I ran into a Java performance problem. It was an extremely simple Scanner loop, reading a file of some 20,000 lines of text, concatenating the lines into one single string:

Scanner sc = new Scanner(new File(fName), "UTF8");
String result = "";
result += sc.nextLine(); //Avoid this!

// Do something with result

The above loop took incredible long time to finish, and I had no clue of what could possibly be wrong. A colleague glanced at the code and said "StringBuilder". I had forgotten about the poor performance of string concatenation using += (or +). I must have thought that this was a problem of the past.

Removing the += part for a StringBuilder resulted in excellent performance:

Scanner sc = new Scanner(new File(fName), "UTF8");
StringBuilder result = new StringBuilder();
while (sc.hasNextLine())

// Do something with result.toString

Update: ttaveira points out that you may gain some additional speed by initializing the StringBuilder to a suitable capacity. See the comment below.

Friday, 15 February 2008

Reading/writing non-default character encoded data in Java

When in an environment where the default (system) character encoding differs from the desired character encoding of the output data, you can use System.setOut and System.setErr. For reading data of a different character encoding than the default encoding, you can tell e.g. the Scanner class what character encoding to expect.

The following could be used for reading and writing UTF8 data on a system where the default character encoding may be different from UTF8:

System.setOut(new PrintStream(System.out,true,"UTF8"));
System.setErr(new PrintStream(System.err,true,"UTF8"));

Scanner scanner = new Scanner(new File(fileName), "UTF8");

// Read input lines,
String line = scanner.nextLine();
line = doSomething(line);
// Write some output to STDOUT/STDERR

The boolean flag of the second constructor argument of PrintStream activates autoflush, but one does not need to use this argument.