Sometimes, one runs into UTF-8 strings with characters from different code blocks. This is problematic in cases where the fonts look the same, but the characters are different. The Scala REPL is handy for finding out what Unicode block each character in a string belongs to. Let's use "ЕКАТEРИНБУРГ" and "ЕКАТЕРИНБУРГ" as examples:
scala> "ЕКАТEРИНБУРГ" == "ЕКАТЕРИНБУРГ"The REPL exposed one of the seemingly identical strings to be an unhealthy mix of Latin and Cyrillic characters. Thanks, REPL.
res0: Boolean = false
scala> import java.lang.Character.UnicodeBlock
import java.lang.Character.UnicodeBlock
scala> "ЕКАТEРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
E BASIC_LATIN
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC
scala> "ЕКАТЕРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
Е CYRILLIC
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC
scala>
No comments:
Post a Comment