Nikoloogle Lindbloogle: touppercase

Showing posts with label touppercase. Show all posts

Tuesday, 24 March 2009

The perils of changing the case of UTF8 strings

Below are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing joke in an earlier post).

In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).

Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.

The table was generated using Scala (thus Java) strings. The column EqIgnoreCase reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's equalsIgnoreCase. The two rightmost columns present the length of the string before and after changing the case up and down again.

Orig	UpCase ↑	UpDown ⇅	EqIgnoreCase	OrigLen	NewLen
ß	SS	ss	false	1	2
ςσ	ΣΣ	σς	true	2	2
ΰΐ	Ϋ́Ϊ́	ΰΐ	false	2	6
iİıI	IİII	iiii	true	4	4
iİıI	İİII	iiıı	true	4	4

The lesson? Nothing special. That you can do terrible things to strings. That changing the case of strings may be an irreversible operation. That if you are to normalize some text into either lower or uppercase, you might need to decide what's most suitable for a given language. That it might be a good idea to keep the original strings after normalization. That using the correct locale might help. That I'm not a graphical designer (the table is hideous).

Sunday, 8 March 2009

Scala: Reversing a string by up- and then downcasing it

Did you know that you can reverse a string by merely upcasing it and then downcasing it again? Here's an example:

scala> val s = "ςσσ"
s: java.lang.String = ςσσ

scala> s.toUpperCase.toLowerCase == s.reverse.toString
res0: Boolean = true

scala>

If you don't believe me, just copy and paste the two lines of code above into the Scala interpreter, and see it for yourself.

Friday, 16 May 2008

Scala one-liner for upcasing lines of text

The following is a Scala script that up-cases each line of an UTF8 encoded input file (args(0)) and prints the result to standard output:

import scala.io.Source

Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8"))

Source.fromFile(args(0), "UTF8").getLines.foreach(line => print(line.toUpperCase))

If you're trusting the default character encoding to work for you, you may reduce it to:


import scala.io.Source

Source.fromFile(args(0)).getLines.foreach(line => print(line.toUpperCase))

Another way to do it, is to read the lines into an iterator, using the iterator's .map method to upcase each line:


import scala.io.Source

val lines = Source.fromFile(args(0)).getLines.map(_.toUpperCase)

lines.foreach(print)

A Java programmer may be relieved (or horrified) to learn that Scala does not have any checked exceptions. There are only runtime exceptions, and you don't need to add any try/catch statements if you don't want to.

When you run a Scala script, you can instruct the Scala interpreter to compile the script, and use the compiled version (a jar file) if it's younger than the source-file. This gives better performance (shorter start-up, etc). You use the savecompiled command line argument.

Sunday, 25 November 2007

Correct case with Java's Locale

In Turkish, the uppercase version of 'i' is 'İ' (not 'I'). The problem is that the Turkish and the "ordinary" Latin 'i' is the same character (the same Unicode code point). If you upcase the 'i' in a Turkish context using the default settings, you might get the wrong letter.

In Java, you can use the Locale class to get this right:


Locale tr = new Locale("tr"); //Turkish
String trI = "i".toUpperCase(tr);
System.out.println(trI);

The above code outputs

İ

(and not 'I').

Be aware that comparing Turkish strings may not work flawlessly. See also this post.

You should also notice that changing the Locale does other things as well. For instance you might end up getting error messages in Turkish...

Nikoloogle Lindbloogle