Thursday 6 May 2010

Using the Scala REPL to tell the difference between ЕКАТEРИНБУРГ and ЕКАТЕРИНБУРГ

Sometimes, one runs into UTF-8 strings with characters from different code blocks. This is problematic in cases where the fonts look the same, but the characters are different. The Scala REPL is handy for finding out what Unicode block each character in a string belongs to. Let's use "ЕКАТEРИНБУРГ" and "ЕКАТЕРИНБУРГ" as examples:

scala> "ЕКАТEРИНБУРГ" == "ЕКАТЕРИНБУРГ"
res0: Boolean = false

scala> import java.lang.Character.UnicodeBlock
import java.lang.Character.UnicodeBlock

scala> "ЕКАТEРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
E BASIC_LATIN
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala> "ЕКАТЕРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
Е CYRILLIC
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala>
The REPL exposed one of the seemingly identical strings to be an unhealthy mix of Latin and Cyrillic characters. Thanks, REPL.