Tuesday 22 April 2008

Keeping empty fields when splitting tab separated lines in Java

Frequently, I process text files containing tab separated data. Sometimes these have empty columns, i.e., two or more tabs without any data between them. More often than not, I want to keep the empty fields. However, Java's String.split defaults to removing empty fields.

This is what you do to keep the empty fields:

String[] fields = string.split("\t", -1)

In the following example, the test string tst will be split into zero parts (result1) and four parts (result2) respectively:
String tst = "\t\t\t";
String[] result1 = tst.split("\t"); //result1.length == 0
String[] result2 = tst.split("\t", -1); //result2.length == 4
result2 will contain four instances of the empty string ("").

The same thing goes when you split a string using a pre-compiled regular expression:
Pattern pattern = Pattern.compile("\t");
String[] result3 = pattern.split(tst); //result3.length == 0
String[] result4 = pattern.split(tst, -1); //result4.length == 4

By the way, I compared the performance of the two variants above (String's split and a pre-compiled pattern matching a tab). Luckily, the difference in performance was negligible, the compiled pattern winning with a small margin. When the split pattern is more complicated, I would expect bigger performance differences between compiled and uncompiled regular expressions. (Running Sun's java command with and without the server argument made a big difference, however. The default client was significantly slower.)

3 comments:

Anonymous said...

Thank you very much!

Moss said...

Thank you! I was going crazy trying to make sense of the Javadoc on this, but you answered the question I actually care about.

Anonymous said...

big thx, this is what i needed!