Tuesday 9 June 2009

Printing the Unicode code points of UTF8 characters (Scala)

Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)

Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä' character, like this:

scala> "%04X".format('ä'.toInt)
res0: String = 00E4

scala>


Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:
scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C

scala>
(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)

Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:
scala> var v = "\u0278"
v: java.lang.String = ɸ

scala>



In Java, it looks similar, but you have to cast your chars to ints:

String.format("%04X", (int) 'ä'), etc.

2 comments:

Daniel said...

On a side note, RichInt has the methods toHexString, toOctalString and toBinaryString.

Anonymous said...

Apologies, because this is slightly pedantic, but I still think it's worth pointing out that this may not always show the Unicode code points, nor are they UTF-8 characters. Instead, it shows the code units in Java's UTF-16 representation.

For example, the String "\uD801\uDC00" has two code units, but represents one code point U+010400, and is encoded as four bytes in UTF-8.