Thursday, 20 March 2008

Beware of Sun's Java equalsIgnoreCase --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}


Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false


I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

3 comments:

Gökhan Ersümer said...

This is not Atatürk's fault. I think it is our short-sighted politicians fault who dont pay attention international standardization of Turkish.

lowercase( I ) = ı
lowercase( İ ) = i
uppercase( i ) = İ
uppercase( ı ) = I
in Turkish. There is no confusion.

But on Windows, Microsoft uses
lowercase( İ ) = İ (no change!)
lowercase( I ) = i (wrong !)
uppercase( i ) = İ (Right, It's Incredible! )
uppercase( ı ) = ı (no change!)

Sarah said...

Just in case somebody stumbles across this, it's worth adding a little context. This is not actually a bug in Java: it's simply not possible to do completely accurate case folding across all languages without selecting a specific locale. The W3C has a good treatment that includes useful recommendations.

Anonymous said...

Well, problem is regarding to equalsIgnoreCase() it's just ignore Locale that you set. You should go with toLower().equal() if you're not using En that's all. There is no BUG ! in Sun's Java or Turkish ...