Nikoloogle Lindbloogle: Beware of Sun's Java equalsIgnoreCase --- Turkish example

Thursday, 20 March 2008

Beware of Sun's Java `equalsIgnoreCase` --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}

Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false

I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

3 comments:

Mad Grunt said...: This is not Atatürk's fault. I think it is our short-sighted politicians fault who dont pay attention international standardization of Turkish.

lowercase( I ) = ı
lowercase( İ ) = i
uppercase( i ) = İ
uppercase( ı ) = I
in Turkish. There is no confusion.

But on Windows, Microsoft uses
lowercase( İ ) = İ (no change!)
lowercase( I ) = i (wrong !)
uppercase( i ) = İ (Right, It's Incredible! )
uppercase( ı ) = ı (no change!); 20 March 2008 at 22:35
Sarah said...: Just in case somebody stumbles across this, it's worth adding a little context. This is not actually a bug in Java: it's simply not possible to do completely accurate case folding across all languages without selecting a specific locale. The W3C has a good treatment that includes useful recommendations.; 5 May 2013 at 03:51
Anonymous said...: Well, problem is regarding to equalsIgnoreCase() it's just ignore Locale that you set. You should go with toLower().equal() if you're not using En that's all. There is no BUG ! in Sun's Java or Turkish ...; 11 December 2014 at 14:21

Nikoloogle Lindbloogle

Thursday, 20 March 2008

Beware of Sun's Java `equalsIgnoreCase` --- Turkish example

3 comments:

Blog Archive

About Me