Friday 28 March 2008

No Web Start for 64-bit Sun Java

Sun does not include Java Web Start in its 64-bit version of Java. It appears that Sun thinks that you are not supposed to run Web Start on 64-bit machines, since these mostly are servers (?), and... eh... sorry, I cannot follow their reasoning. Let's hope they change their minds.

I haven't tried it myself, but here is a description of how to run 32-bit Java Web Start on 64-bit Ubuntu.

Update: At the time of writing this, an AMD64 version of Java Web Start is at the top of Sun's Request for Enhancements list.

Update: There will be support for 64-bit Java Web Start in an upcoming release, 1.6.0_12 (I think). Ismael Juma points out that an early access release is available. See his comment below.

Wednesday 26 March 2008

Frequency list bash function

In addition to command aliases (see an earlier post), you can add your own functions to the bash shell. Here is a simple but useful command line sequence:

function freq() {
sort $* | uniq -c | sort -rn;
}

Put it in ~/.bashrc and you will have a freq command for creating frequency lists:
freq <FILES>
will sort and count all identical lines of the input file(s), and present them in descending frequency. Useful in many situations, not the least for checking that files that are supposed to only contain unique lines actually do so.

(I'm not too sure about bash function syntax, but the function above seems to do its work.)

If you're not familiar with the different commands of the pipeline above, there is plenty to read (e.g., egrep for linguists).

Tuesday 25 March 2008

Favourite bash command line aliases

My favourite bash aliases currently are

alias hist='history|egrep'

and
alias ös='ls'

The second one for the reason that 'ö' sits next to 'l' on my Swedish keyboard, and when I intended to type 'ls' I type 'ös' more often than not. The one I use the most, however, is alias more='m' (I also have the classic more='mroe' and more='moer' to catch some frequent typos).

The first one, hist, makes it possible to use regular expressions to search the history of earlier shell commands. This is useful when you cannot remember some tricky command line sequence, or are too lazy to type some long command that you know you issued the other day.

For instance

hist 'java|ruby'

will print any previous command (in bash's history) containing any of the two strings.

(Well, I think you can accomplish the same thing using the original history command, but to paraphrase Morrissey, now my head is full, and my brain doesn't have room for more cryptic command line arguments.)

You can put your bash aliases in ~/.bashrc.

(Thanks to Chris for spotting a (now corrected) mistake in the first example. See the comment below.)

Update: Hey, checkout the comment by Anonymous below: Ctrl-r seems useful for searching the Bash history!

Thursday 20 March 2008

Beware of Sun's Java equalsIgnoreCase --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}


Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false


I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

Tuesday 18 March 2008

Firebird vs Postgresql

We have similar databases running on MySql, Postgresql and Firebird. One of the reasons for moving away from MySql was the fact that the UTF8 support didn't work properly. I cannot remember the details, but it had to do with non-Latin-1 data, such as text in Czech or Russian. In some situations MySql refused to correctly identify equal UTF8 strings. You put in some word that you cannot retrieve again, bleh!

Furthermore, we've never understood how the user permissions are supposed to work in MySql (we always end up frantically running all possible variants of the GRANT ALL command).

We moved to Postgresql, which worked a lot better. Now we've started using Firebird, that also seems like a very nice piece of software.

Here is list of a few things I've noticed when moving from Postgresql to Firebird:

* Firebird lacks built-in support for regular expressions. (We make heavy use of complex string searches of natural language data. If we hadn't got help from an expert, who helped us compile some user defined functions, UDF:s, for this purpose, this would have been a show-stopper.)

* Postgres' psql command line tool is better than Firebird's isql(-fb). (If you are a Windows user, see Carlos' comment below)

* Firebird database files grow and grow. This is true even if you delete data. You have to manually back-up and restore a database to reclaim disk space. Maybe this is not a great problem in normal usage, but I noticed that the databases I use for running test suits against keep growing, though the test database itself is quite small (and the data are cleared out between test runs). [Update: Please notice that long-time users of Firebird insist that this is not a problem. See Carlos', Sergio Marcelo's and also Michal's comments below.]

* I've never had any luck installing Firebird from a Debian package. I have had to do a manual install to get it to work

* Firebird has a useful GUI, FlameRobin, that let's you inspect and change your databases. FlameRobin comes with an editor useful for writing/editing stored procedures. The editor has code completion, that helps you with suggestions of table and column names and the like as you type.

* Firebird has a nice way to manage database files: all tables of a database end up in a single file, that you can name whatever you like, and put wherever you like.

* It appears to be easier to find useful documentation for Postgres than for Firebird (but Firebird does have a nice FAQ site)


Answer to Darius Damalakas comment below: I'm not the right person to comment on the performance of the different DBMSs. However, we haven't noticed any significant difference in performance between MySql, Postgresql and Firebird. Currently, the bottlenecks in our software are to be found outside of the databases, so the performance of the individual DBMSs has not been a big concern. They're all fast enough.

Firebird does seem to be a snappy system, and I would be surprised to find it to perform less good than Postgres.

So far, the only difference in features that has mattered to us, is the lack of built-in support for regular expressions in Firebird (see above). In all other respects (of importance to us), the functionality of Postgres and Firebird seems equivalent.

Update: Support for regular expressions is scheduled for the upcoming 2.5.0 release of Firebird.

Update: In response to an anonymous (and rather critical) comment, mariuz has added some useful links in a comment below.

Update: In a comment below, Michal has posted some information on DatabaseGrowthIncrement, taken from the release notes of Firebird 2.1.

Saturday 15 March 2008

Beware of Firebird 2.0 Debian package

We are migrating an application to the Firebird 2.0 database manager (firebirdsql.org). Our server runs Debian (AMD64), and we used the Firebird 2.0 (superserver) Debian package as suggested in the Firebird site's FAQ section. However, when the package was installed, it appears to have silently overlooked a dependency, missing a library necessary for getting the "user defined functions", UDF:s, to work correctly. (Firebird didn't find the UDF:s, resulting in runtime errors when calls to the functions were issued from a Firebird database.)

We made sure that Firebird as well as the UDF:s were all compiled for AMD64.

When uninstalling the apt-get Firebird package, and manually installing Firebird 2.0.3 from the standard .tar.gz file, the missing dependency was spotted, and the database could be properly installed. Unfortunately, I didn't keep a record, but it might have been the correct version of libstdc++ that was missing.

As far as I can remember, this is the only time a Debian apt-get package has failed me. In addition to the fact the apt-get install of Firebird might be broken, you have to be careful not to apt-get install "Firebird 2", since this will give you Firebird 1.5! Peculiar. (But see the comment from mariuz below).

I had a similar experience the first time I tried to install Firebird from a Debian package. This was Firebird 1.5 (the Firebird 2 Debian package), before Firebird 2.0 was released. I never got that one to run either, but had to install the tar.gz version obtained from the official Firebird webserver. I can't remember exactly what went wrong at that time, but it was impossible to get the Debian package that we tried at that time to work. The manual install worked perfectly, just as it did this time.

Update: The Debian Firebird2 package (containing Firebird 1.5) appears to be discontinued.

Don't concatenate Java strings using +=

The other day, I ran into a Java performance problem. It was an extremely simple Scanner loop, reading a file of some 20,000 lines of text, concatenating the lines into one single string:

Scanner sc = new Scanner(new File(fName), "UTF8");
String result = "";
while(sc.hasNextLine())
{
result += sc.nextLine(); //Avoid this!
}

// Do something with result


The above loop took incredible long time to finish, and I had no clue of what could possibly be wrong. A colleague glanced at the code and said "StringBuilder". I had forgotten about the poor performance of string concatenation using += (or +). I must have thought that this was a problem of the past.

Removing the += part for a StringBuilder resulted in excellent performance:

Scanner sc = new Scanner(new File(fName), "UTF8");
StringBuilder result = new StringBuilder();
while (sc.hasNextLine())
{
result.append(sc.nextLine());
}

// Do something with result.toString

Update: ttaveira points out that you may gain some additional speed by initializing the StringBuilder to a suitable capacity. See the comment below.