Nikoloogle Lindbloogle: Keeping empty fields when splitting tab separated lines in Java

Frequently, I process text files containing tab separated data. Sometimes these have empty columns, i.e., two or more tabs without any data between them. More often than not, I want to keep the empty fields. However, Java's String.split defaults to removing empty fields.

This is what you do to keep the empty fields:

String[] fields = string.split("\t", -1)

In the following example, the test string tst will be split into zero parts (result1) and four parts (result2) respectively:

String tst = "\t\t\t";
String[] result1 = tst.split("\t");       //result1.length == 0
String[] result2 = tst.split("\t", -1);   //result2.length == 4

result2 will contain four instances of the empty string ("").

The same thing goes when you split a string using a pre-compiled regular expression:

Pattern pattern = Pattern.compile("\t");
String[] result3 = pattern.split(tst);     //result3.length == 0
String[] result4 = pattern.split(tst, -1); //result4.length == 4

By the way, I compared the performance of the two variants above (String's split and a pre-compiled pattern matching a tab). Luckily, the difference in performance was negligible, the compiled pattern winning with a small margin. When the split pattern is more complicated, I would expect bigger performance differences between compiled and uncompiled regular expressions. (Running Sun's java command with and without the server argument made a big difference, however. The default client was significantly slower.)

3 comments:

Anonymous said...: Thank you very much!; 13 November 2009 at 11:48
Moss said...: Thank you! I was going crazy trying to make sense of the Javadoc on this, but you answered the question I actually care about.; 8 March 2010 at 02:18
Anonymous said...: big thx, this is what i needed!; 10 December 2012 at 11:47

Nikoloogle Lindbloogle

Tuesday, 22 April 2008

Keeping empty fields when splitting tab separated lines in Java

3 comments:

Blog Archive

About Me