Nikoloogle Lindbloogle: unicode

Showing posts with label unicode. Show all posts

Tuesday, 9 June 2009

Printing the Unicode code points of UTF8 characters (Scala)

Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)

Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä' character, like this:

scala> "%04X".format('ä'.toInt)
res0: String = 00E4

scala>

Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:

scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C

scala>

(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)

Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:

scala> var v = "\u0278"
v: java.lang.String = ɸ

scala>

In Java, it looks similar, but you have to cast your chars to ints:

String.format("%04X", (int) 'ä'), etc.

Thursday, 5 March 2009

The Firebird database: Problem handling UTF8 characters

The 'Latin capital letter I with dot above', İ (Unicode 0130), strikes again! This innocent looking Turkish character seems to be reliable when it comes to breaking software that should be able to handle UTF8. (See also this post for a Java example.)

This time it breaks the Firebird database (in my case, v2.1.1 on a 64-bit Debian system). Downcasing some random characters in a database configured to handle UTF8 works fine:

SELECT LOWER('AӴЁΪΣƓ') FROM RDB$DATABASE

returns the expected string, aӵёϊσɠ.

However, when you throw in the trouble-making İ, everything blows up:

SELECT LOWER('AӴЁΪΣƓİ') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -104
Invalid token

Engine Code    : 335544849
Engine Message :
Malformed string

Slightly different input, generates a different error message:

SELECT LOWER('İA') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -802
Arithmetic overflow or division by zero has occurred.

Engine Code    : 335544321
Engine Message :
arithmetic exception, numeric overflow, or string truncation

There is an item on the Firebird user list, but without any answers so far.

Update: As mariuz points out in a comment below, this defect now seems to be fixed in an upcoming version. See this bug tracker item.

Friday, 5 September 2008

Case insensitive pattern matching of Unicode strings in Java

To make case insensitive pattern matching of Unicode strings in Java, you can call Pattern.compile with a second argument, like this:

Pattern p = 
Pattern.compile(patternString, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

(This is useful when dealing with non-ASCII/non-Latin1 text, such as Cyrillic. However, it may not work flawlessly for the Turkish Unicode characters.)

Update: I just learned that there is a nicer way of doing this: start the patternString above with "(?iu)":

Pattern p = 
Pattern.compile("(?iu)"+ patternString);

Thursday, 20 March 2008

Beware of Sun's Java `equalsIgnoreCase` --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}

Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false

I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

Friday, 15 February 2008

Reading/writing non-default character encoded data in Java

When in an environment where the default (system) character encoding differs from the desired character encoding of the output data, you can use System.setOut and System.setErr. For reading data of a different character encoding than the default encoding, you can tell e.g. the Scanner class what character encoding to expect.

The following could be used for reading and writing UTF8 data on a system where the default character encoding may be different from UTF8:


System.setOut(new PrintStream(System.out,true,"UTF8"));
System.setErr(new PrintStream(System.err,true,"UTF8"));

Scanner scanner = new Scanner(new File(fileName), "UTF8");

while(scanner.hasNextLine())
   {
    // Read input lines,
    String line = scanner.nextLine();
    line = doSomething(line);
    // Write some output to STDOUT/STDERR
    System.out.println(line);
    ...
   }

The boolean flag of the second constructor argument of PrintStream activates autoflush, but one does not need to use this argument.

Sunday, 25 November 2007

Correct case with Java's Locale

In Turkish, the uppercase version of 'i' is 'İ' (not 'I'). The problem is that the Turkish and the "ordinary" Latin 'i' is the same character (the same Unicode code point). If you upcase the 'i' in a Turkish context using the default settings, you might get the wrong letter.

In Java, you can use the Locale class to get this right:


Locale tr = new Locale("tr"); //Turkish
String trI = "i".toUpperCase(tr);
System.out.println(trI);

The above code outputs

İ

(and not 'I').

Be aware that comparing Turkish strings may not work flawlessly. See also this post.

You should also notice that changing the Locale does other things as well. For instance you might end up getting error messages in Turkish...

Tuesday, 6 November 2007

Unicode mystery: Identical typeface, different characters

Do the following two strings look similar to you?

уeЕoОxХaАM
yеEоOхXаAМ

Sorry, but they aren't. Not one of the corresponding characters of those strings are equal, but for the eye. For example, 'y' and 'у' have different Unicode codes (0079 and 0443).

The fact is that both strings contain a mix of Latin and Cyrillic characters, but not the same mix. There are a number of Cyrillic and Latin characters that look the same, but aren't. Comparing these strings using, for example, Java's String.equals method will return false. Sorting strings of mixed character encodings (but with identically looking type faces) will produce odd results. Latin P is quite different from the Russian Р, etc.

Giving it a bit of thought, it is not strange at all, but the first time you run into the problem, it can be quite tough to figure it out. It's like the first time you accidentally activated the "Insert" key on you keyboard, and thought that your computer was broken. (By the way: "Insert", "Scroll Lock", "Pause/Break"... What on earth are these keys doing on my keyboard...?!)

If you are dealing with (language) data of different character encodings, it's wise to validate your strings, to ensure that, e.g., a Russian string contains only Cyrillic characters, and that the Western European ones contain only Latinos (see this related post).

And don't miss the 'A' example in the comment below!

Nikoloogle Lindbloogle