Tuesday, 6 November 2007

Unicode mystery: Identical typeface, different characters

Do the following two strings look similar to you?


Sorry, but they aren't. Not one of the corresponding characters of those strings are equal, but for the eye. For example, 'y' and 'у' have different Unicode codes (0079 and 0443).

The fact is that both strings contain a mix of Latin and Cyrillic characters, but not the same mix. There are a number of Cyrillic and Latin characters that look the same, but aren't. Comparing these strings using, for example, Java's String.equals method will return false. Sorting strings of mixed character encodings (but with identically looking type faces) will produce odd results. Latin P is quite different from the Russian Р, etc.

Giving it a bit of thought, it is not strange at all, but the first time you run into the problem, it can be quite tough to figure it out. It's like the first time you accidentally activated the "Insert" key on you keyboard, and thought that your computer was broken. (By the way: "Insert", "Scroll Lock", "Pause/Break"... What on earth are these keys doing on my keyboard...?!)

If you are dealing with (language) data of different character encodings, it's wise to validate your strings, to ensure that, e.g., a Russian string contains only Cyrillic characters, and that the Western European ones contain only Latinos (see this related post).

And don't miss the 'A' example in the comment below!


Hanna Lindgren said...

How do you like these three fellows, then?

Russian: А (U+0410)
Greek: Α (U+0391)
Latin: A (U+0061)

(Unicode number in brackets)

Nikolaj Lindberg said...

I hte them, ll three of them. I would never use ny of those bsurd chrcters.