Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Tuesday, 9 June 2009

Printing the Unicode code points of UTF8 characters (Scala)

Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)

Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä' character, like this:

scala> "%04X".format('ä'.toInt)
res0: String = 00E4

scala>


Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:
scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C

scala>
(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)

Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:
scala> var v = "\u0278"
v: java.lang.String = ɸ

scala>



In Java, it looks similar, but you have to cast your chars to ints:

String.format("%04X", (int) 'ä'), etc.

Friday, 5 September 2008

Case insensitive pattern matching of Unicode strings in Java

To make case insensitive pattern matching of Unicode strings in Java, you can call Pattern.compile with a second argument, like this:

Pattern p = 
Pattern.compile(patternString, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);


(This is useful when dealing with non-ASCII/non-Latin1 text, such as Cyrillic. However, it may not work flawlessly for the Turkish Unicode characters.)

Update: I just learned that there is a nicer way of doing this: start the patternString above with "(?iu)":
Pattern p = 
Pattern.compile("(?iu)"+ patternString);

Friday, 16 May 2008

Scala one-liner for upcasing lines of text

The following is a Scala script that up-cases each line of an UTF8 encoded input file (args(0)) and prints the result to standard output:

import scala.io.Source

Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8"))

Source.fromFile(args(0), "UTF8").getLines.foreach(line => print(line.toUpperCase))


If you're trusting the default character encoding to work for you, you may reduce it to:

import scala.io.Source

Source.fromFile(args(0)).getLines.foreach(line => print(line.toUpperCase))



Another way to do it, is to read the lines into an iterator, using the iterator's .map method to upcase each line:

import scala.io.Source

val lines = Source.fromFile(args(0)).getLines.map(_.toUpperCase)

lines.foreach(print)



A Java programmer may be relieved (or horrified) to learn that Scala does not have any checked exceptions. There are only runtime exceptions, and you don't need to add any try/catch statements if you don't want to.

When you run a Scala script, you can instruct the Scala interpreter to compile the script, and use the compiled version (a jar file) if it's younger than the source-file. This gives better performance (shorter start-up, etc). You use the savecompiled command line argument.

Saturday, 10 May 2008

The Scala programming language and XML

The Scala programming language is a combined scripting and "proper" language, that sits on top of the Java VM. You can either run scripts similar to how you run a Ruby or Perl script, or compile your Scala classes to Java bytecode. You run a Scala application similar to how you run a Java application. You can also run a Scala application using the Java VM (but you have to add the Scala library jar file to your class path). You can mix Java and Scala programs, calling Scala objects from Java, and vice versa.

Scala has a feature that I have never seen in a language suitable for general programming: XML (processing) as a feature of the language. The people behind Scala has added XML to the syntax of the language itself. You do not have to load some library or use some special API for processing XML, since it's already part of the language.

It is not only that XML is valid in Scala code, but XML has its own built-in data types. For instance,

val xml = <vegetable>potato</vegetable>
is a valid Scala statement. In other words, XML-elements written in a Scala program are just not merely strings. The xml object can now be manipulated in various ways, much like a DOM object in Java (but with less hassle than in Java).

You can refer to variables in your XML:
val veg = "potato"
val col = "white"
val xml = <vegetable colour={col}>{veg}</vegetable>

// The value of xml now corresponds to
// <vegetable colour="white">potato</vegetable>



You can also embed function/method calls into XML elements. Imagine that you have a method that returns a sequence of n XML elements, like this (you'll need to import scala.xml.NodeSeq and scala.xml.NodeBuffer):

def genNumElems(n :Int) :NodeSeq = {
val result = new NodeBuffer
for(i <- 1 to n) {
result &+ (<number value={i.toString}/>)
}
result
}
(The odd-looking &+ operator means "add".)

You can now embed a call to genNumElems in an XML element, e.g., like this:
val numList = <number_list>{genNumElems(4)}<number_list/>

Printing numList produces:
<number_list>
<number value="1"></number>
<number value="2"></number>
<number value="3"></number>
<number value="4"></number>
</number_list>


If you want nicer output, you can use a PrettyPrinter (that you import from scala.xml._):
val pp = new PrettyPrinter(100,2) // width and indentation
println(pp.format(xml))


Reading and writing XML data/files to and from Scala is easy. The following is a one-liner that reads an XML file given as a command line argument (args(0)) and returns a list of all elements named "tr" that are child elements to any elements called "table" of the XML file:
val trNodes = scala.xml.XML.loadFile(args(0)) \\ "table" \ "tr"
You may print the <tr> elements (with an empty line between each element) thus:
trNodes.foreach(tr => println(tr + "\n"))

The built-in XML support in Scala's syntax and basic libraries are not the most important or interesting features of Scala, but they sure seem to be very useful.

(Incidentally, the table and tr elements above are present in Oocalc's (OpenOffice.org) XML format for spreadsheets.)

Update: It appears that is not always good advice to use scala.xml.XML.loadFile to read an XML document. One reason is that comment elements are lost. For more advanced XML processing, one should turn to scala.xml.parsing.ConstructingParser.fromFile.

Update: You may run into trouble when processing larger XML documents using the second approach. See this comment.

Tuesday, 22 April 2008

Keeping empty fields when splitting tab separated lines in Java

Frequently, I process text files containing tab separated data. Sometimes these have empty columns, i.e., two or more tabs without any data between them. More often than not, I want to keep the empty fields. However, Java's String.split defaults to removing empty fields.

This is what you do to keep the empty fields:

String[] fields = string.split("\t", -1)

In the following example, the test string tst will be split into zero parts (result1) and four parts (result2) respectively:
String tst = "\t\t\t";
String[] result1 = tst.split("\t"); //result1.length == 0
String[] result2 = tst.split("\t", -1); //result2.length == 4
result2 will contain four instances of the empty string ("").

The same thing goes when you split a string using a pre-compiled regular expression:
Pattern pattern = Pattern.compile("\t");
String[] result3 = pattern.split(tst); //result3.length == 0
String[] result4 = pattern.split(tst, -1); //result4.length == 4

By the way, I compared the performance of the two variants above (String's split and a pre-compiled pattern matching a tab). Luckily, the difference in performance was negligible, the compiled pattern winning with a small margin. When the split pattern is more complicated, I would expect bigger performance differences between compiled and uncompiled regular expressions. (Running Sun's java command with and without the server argument made a big difference, however. The default client was significantly slower.)

Friday, 28 March 2008

No Web Start for 64-bit Sun Java

Sun does not include Java Web Start in its 64-bit version of Java. It appears that Sun thinks that you are not supposed to run Web Start on 64-bit machines, since these mostly are servers (?), and... eh... sorry, I cannot follow their reasoning. Let's hope they change their minds.

I haven't tried it myself, but here is a description of how to run 32-bit Java Web Start on 64-bit Ubuntu.

Update: At the time of writing this, an AMD64 version of Java Web Start is at the top of Sun's Request for Enhancements list.

Update: There will be support for 64-bit Java Web Start in an upcoming release, 1.6.0_12 (I think). Ismael Juma points out that an early access release is available. See his comment below.

Thursday, 20 March 2008

Beware of Sun's Java equalsIgnoreCase --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}


Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false


I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

Saturday, 15 March 2008

Don't concatenate Java strings using +=

The other day, I ran into a Java performance problem. It was an extremely simple Scanner loop, reading a file of some 20,000 lines of text, concatenating the lines into one single string:

Scanner sc = new Scanner(new File(fName), "UTF8");
String result = "";
while(sc.hasNextLine())
{
result += sc.nextLine(); //Avoid this!
}

// Do something with result


The above loop took incredible long time to finish, and I had no clue of what could possibly be wrong. A colleague glanced at the code and said "StringBuilder". I had forgotten about the poor performance of string concatenation using += (or +). I must have thought that this was a problem of the past.

Removing the += part for a StringBuilder resulted in excellent performance:

Scanner sc = new Scanner(new File(fName), "UTF8");
StringBuilder result = new StringBuilder();
while (sc.hasNextLine())
{
result.append(sc.nextLine());
}

// Do something with result.toString

Update: ttaveira points out that you may gain some additional speed by initializing the StringBuilder to a suitable capacity. See the comment below.

Friday, 15 February 2008

Reading/writing non-default character encoded data in Java

When in an environment where the default (system) character encoding differs from the desired character encoding of the output data, you can use System.setOut and System.setErr. For reading data of a different character encoding than the default encoding, you can tell e.g. the Scanner class what character encoding to expect.

The following could be used for reading and writing UTF8 data on a system where the default character encoding may be different from UTF8:


System.setOut(new PrintStream(System.out,true,"UTF8"));
System.setErr(new PrintStream(System.err,true,"UTF8"));

Scanner scanner = new Scanner(new File(fileName), "UTF8");

while(scanner.hasNextLine())
{
// Read input lines,
String line = scanner.nextLine();
line = doSomething(line);
// Write some output to STDOUT/STDERR
System.out.println(line);
...
}


The boolean flag of the second constructor argument of PrintStream activates autoflush, but one does not need to use this argument.

Sunday, 25 November 2007

Correct case with Java's Locale

In Turkish, the uppercase version of 'i' is 'İ' (not 'I'). The problem is that the Turkish and the "ordinary" Latin 'i' is the same character (the same Unicode code point). If you upcase the 'i' in a Turkish context using the default settings, you might get the wrong letter.

In Java, you can use the Locale class to get this right:


Locale tr = new Locale("tr"); //Turkish
String trI = "i".toUpperCase(tr);
System.out.println(trI);

The above code outputs

İ
(and not 'I').

Be aware that comparing Turkish strings may not work flawlessly. See also this post.

You should also notice that changing the Locale does other things as well. For instance you might end up getting error messages in Turkish...