Tuesday, 24 March 2009

The perils of changing the case of UTF8 strings

Below are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing joke in an earlier post).

In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).

Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.

The table was generated using Scala (thus Java) strings. The column EqIgnoreCase reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's equalsIgnoreCase. The two rightmost columns present the length of the string before and after changing the case up and down again.

OrigUpCase ↑UpDown ⇅EqIgnoreCaseOrigLenNewLen

The lesson? Nothing special. That you can do terrible things to strings. That changing the case of strings may be an irreversible operation. That if you are to normalize some text into either lower or uppercase, you might need to decide what's most suitable for a given language. That it might be a good idea to keep the original strings after normalization. That using the correct locale might help. That I'm not a graphical designer (the table is hideous).


Landei said...

Your remarks about ß are not entirely true:


But I have to admit that the capital ß isn't widely used yet.

Nikolaj Lindberg said...


thanks for your comment --- I didn't know about ẞ!

That's great to know, since it might help when you need to handle up-casing ß. (Using ẞ, you don't have to introducing scary symbols of your own.)