Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)
Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä'
character, like this:
scala> "%04X".format('ä'.toInt)
res0: String = 00E4
scala>
Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:
scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C
scala>
Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:
scala> var v = "\u0278"
v: java.lang.String = ɸ
scala>
In Java, it looks similar, but you have to cast your chars to ints:
String.format("%04X", (int) 'ä')
, etc.
2 comments:
On a side note, RichInt has the methods toHexString, toOctalString and toBinaryString.
Apologies, because this is slightly pedantic, but I still think it's worth pointing out that this may not always show the Unicode code points, nor are they UTF-8 characters. Instead, it shows the code units in Java's UTF-16 representation.
For example, the String "\uD801\uDC00" has two code units, but represents one code point U+010400, and is encoded as four bytes in UTF-8.
Post a Comment