Below are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing joke in an earlier post).
In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).
Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.
The table was generated using Scala (thus Java) strings. The column EqIgnoreCase
reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's equalsIgnoreCase
. The two rightmost columns present the length of the string before and after changing the case up and down again.
Orig | UpCase ↑ | UpDown ⇅ | EqIgnoreCase | OrigLen | NewLen |
---|---|---|---|---|---|
ß | SS | ss | false | 1 | 2 |
ςσ | ΣΣ | σς | true | 2 | 2 |
ΰΐ | Ϋ́Ϊ́ | ΰΐ | false | 2 | 6 |
iİıI | IİII | iiii | true | 4 | 4 |
iİıI | İİII | iiıı | true | 4 | 4 |
2 comments:
Your remarks about ß are not entirely true:
http://en.wikipedia.org/wiki/Capital_%C3%9F
But I have to admit that the capital ß isn't widely used yet.
Landei,
thanks for your comment --- I didn't know about ẞ!
That's great to know, since it might help when you need to handle up-casing ß. (Using ẞ, you don't have to introducing scary symbols of your own.)
/nikolaj
Post a Comment