There appears to be a mistake in the implementation of String.equalsIgnoreCase
in Sun's Java.
Look what a colleague sent me (and see an earlier post on Turkish characters below):
import java.io.PrintStream;
import java.util.Locale;
public class TestTur
{
public static final void main(final String[] args) throws Exception
{
Locale.setDefault(new Locale("tr"));
System.setOut(new PrintStream(System.out,true,"UTF8"));
String s1 = "I";
String s2 = "ı";
String s3 = "i";
System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
System.out.println();
System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
}
}
Now, what do you think the above code prints? You would expect that
string1.equalsIgnoreCase(string2)
is exactly the same as
string1.toLowerCase().equals(string2.toLowerCase())
wouldn't you...?
Surprise, surprise. This is what the above code prints:
I==ı? true
I==ı? true
I==i? true
I==i? false
I bet Mustafa Kemal Atatürk didn't see that one coming!
The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.
Part of the problem when dealing with Turkish text (except for the mistake in how Java's
equalsIgnoreCase
works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.
3 comments:
This is not Atatürk's fault. I think it is our short-sighted politicians fault who dont pay attention international standardization of Turkish.
lowercase( I ) = ı
lowercase( İ ) = i
uppercase( i ) = İ
uppercase( ı ) = I
in Turkish. There is no confusion.
But on Windows, Microsoft uses
lowercase( İ ) = İ (no change!)
lowercase( I ) = i (wrong !)
uppercase( i ) = İ (Right, It's Incredible! )
uppercase( ı ) = ı (no change!)
Just in case somebody stumbles across this, it's worth adding a little context. This is not actually a bug in Java: it's simply not possible to do completely accurate case folding across all languages without selecting a specific locale. The W3C has a good treatment that includes useful recommendations.
Well, problem is regarding to equalsIgnoreCase() it's just ignore Locale that you set. You should go with toLower().equal() if you're not using En that's all. There is no BUG ! in Sun's Java or Turkish ...
Post a Comment