Nikoloogle Lindbloogle: utf8

Showing posts with label utf8. Show all posts

Tuesday, 9 June 2009

Printing the Unicode code points of UTF8 characters (Scala)

Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)

Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä' character, like this:

scala> "%04X".format('ä'.toInt)
res0: String = 00E4

scala>

Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:

scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C

scala>

(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)

Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:

scala> var v = "\u0278"
v: java.lang.String = ɸ

scala>

In Java, it looks similar, but you have to cast your chars to ints:

String.format("%04X", (int) 'ä'), etc.

Tuesday, 24 March 2009

The perils of changing the case of UTF8 strings

Below are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing joke in an earlier post).

In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).

Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.

The table was generated using Scala (thus Java) strings. The column EqIgnoreCase reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's equalsIgnoreCase. The two rightmost columns present the length of the string before and after changing the case up and down again.

Orig	UpCase ↑	UpDown ⇅	EqIgnoreCase	OrigLen	NewLen
ß	SS	ss	false	1	2
ςσ	ΣΣ	σς	true	2	2
ΰΐ	Ϋ́Ϊ́	ΰΐ	false	2	6
iİıI	IİII	iiii	true	4	4
iİıI	İİII	iiıı	true	4	4

The lesson? Nothing special. That you can do terrible things to strings. That changing the case of strings may be an irreversible operation. That if you are to normalize some text into either lower or uppercase, you might need to decide what's most suitable for a given language. That it might be a good idea to keep the original strings after normalization. That using the correct locale might help. That I'm not a graphical designer (the table is hideous).

Thursday, 5 March 2009

The Firebird database: Problem handling UTF8 characters

The 'Latin capital letter I with dot above', İ (Unicode 0130), strikes again! This innocent looking Turkish character seems to be reliable when it comes to breaking software that should be able to handle UTF8. (See also this post for a Java example.)

This time it breaks the Firebird database (in my case, v2.1.1 on a 64-bit Debian system). Downcasing some random characters in a database configured to handle UTF8 works fine:

SELECT LOWER('AӴЁΪΣƓ') FROM RDB$DATABASE

returns the expected string, aӵёϊσɠ.

However, when you throw in the trouble-making İ, everything blows up:

SELECT LOWER('AӴЁΪΣƓİ') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -104
Invalid token

Engine Code    : 335544849
Engine Message :
Malformed string

Slightly different input, generates a different error message:

SELECT LOWER('İA') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -802
Arithmetic overflow or division by zero has occurred.

Engine Code    : 335544321
Engine Message :
arithmetic exception, numeric overflow, or string truncation

There is an item on the Firebird user list, but without any answers so far.

Update: As mariuz points out in a comment below, this defect now seems to be fixed in an upcoming version. See this bug tracker item.

Thursday, 20 March 2008

Beware of Sun's Java `equalsIgnoreCase` --- Turkish example

There appears to be a mistake in the implementation of String.equalsIgnoreCase in Sun's Java.

Look what a colleague sent me (and see an earlier post on Turkish characters below):

import java.io.PrintStream;
import java.util.Locale;

public class TestTur
{
 public static final void main(final String[] args) throws Exception
 {
  Locale.setDefault(new Locale("tr"));
  System.setOut(new PrintStream(System.out,true,"UTF8"));

  String s1 = "I";
  String s2 = "ı";
  String s3 = "i";

  System.out.println(s1+"=="+s2+"? "+s1.equalsIgnoreCase(s2));
  System.out.println(s1+"=="+s2+"? "+s1.toLowerCase().equals(s2.toLowerCase()));
  System.out.println();

  System.out.println(s1+"=="+s3+"? "+s1.equalsIgnoreCase(s3));
  System.out.println(s1+"=="+s3+"? "+s1.toLowerCase().equals(s3.toLowerCase()));
 }
}

Now, what do you think the above code prints? You would expect that

string1.equalsIgnoreCase(string2)

is exactly the same as

string1.toLowerCase().equals(string2.toLowerCase())

wouldn't you...?

Surprise, surprise. This is what the above code prints:

I==ı? true
I==ı? true

I==i? true
I==i? false

I bet Mustafa Kemal Atatürk didn't see that one coming!

The above peculiarity did actually lead to some problems for us, so this is a practical problem rather than an academic one.

Part of the problem when dealing with Turkish text (except for the mistake in how Java's equalsIgnoreCase works), is that "Latin" 'i' and Turkish 'i' as well as "Latin" 'I' and Turkish 'I' share the same Unicode codepoints. Maybe they should have been different characters. A little late for that now.

Tuesday, 18 March 2008

Firebird vs Postgresql

We have similar databases running on MySql, Postgresql and Firebird. One of the reasons for moving away from MySql was the fact that the UTF8 support didn't work properly. I cannot remember the details, but it had to do with non-Latin-1 data, such as text in Czech or Russian. In some situations MySql refused to correctly identify equal UTF8 strings. You put in some word that you cannot retrieve again, bleh!

Furthermore, we've never understood how the user permissions are supposed to work in MySql (we always end up frantically running all possible variants of the GRANT ALL command).

We moved to Postgresql, which worked a lot better. Now we've started using Firebird, that also seems like a very nice piece of software.

Here is list of a few things I've noticed when moving from Postgresql to Firebird:

* Firebird lacks built-in support for regular expressions. (We make heavy use of complex string searches of natural language data. If we hadn't got help from an expert, who helped us compile some user defined functions, UDF:s, for this purpose, this would have been a show-stopper.)

* Postgres' psql command line tool is better than Firebird's isql(-fb). (If you are a Windows user, see Carlos' comment below)

* Firebird database files grow and grow. This is true even if you delete data. You have to manually back-up and restore a database to reclaim disk space. Maybe this is not a great problem in normal usage, but I noticed that the databases I use for running test suits against keep growing, though the test database itself is quite small (and the data are cleared out between test runs). [Update: Please notice that long-time users of Firebird insist that this is not a problem. See Carlos', Sergio Marcelo's and also Michal's comments below.]

* I've never had any luck installing Firebird from a Debian package. I have had to do a manual install to get it to work

* Firebird has a useful GUI, FlameRobin, that let's you inspect and change your databases. FlameRobin comes with an editor useful for writing/editing stored procedures. The editor has code completion, that helps you with suggestions of table and column names and the like as you type.

* Firebird has a nice way to manage database files: all tables of a database end up in a single file, that you can name whatever you like, and put wherever you like.

* It appears to be easier to find useful documentation for Postgres than for Firebird (but Firebird does have a nice FAQ site)

Answer to Darius Damalakas comment below: I'm not the right person to comment on the performance of the different DBMSs. However, we haven't noticed any significant difference in performance between MySql, Postgresql and Firebird. Currently, the bottlenecks in our software are to be found outside of the databases, so the performance of the individual DBMSs has not been a big concern. They're all fast enough.

Firebird does seem to be a snappy system, and I would be surprised to find it to perform less good than Postgres.

So far, the only difference in features that has mattered to us, is the lack of built-in support for regular expressions in Firebird (see above). In all other respects (of importance to us), the functionality of Postgres and Firebird seems equivalent.

Update: Support for regular expressions is scheduled for the upcoming 2.5.0 release of Firebird.

Update: In response to an anonymous (and rather critical) comment, mariuz has added some useful links in a comment below.

Update: In a comment below, Michal has posted some information on DatabaseGrowthIncrement, taken from the release notes of Firebird 2.1.

Friday, 15 February 2008

Reading/writing non-default character encoded data in Java

When in an environment where the default (system) character encoding differs from the desired character encoding of the output data, you can use System.setOut and System.setErr. For reading data of a different character encoding than the default encoding, you can tell e.g. the Scanner class what character encoding to expect.

The following could be used for reading and writing UTF8 data on a system where the default character encoding may be different from UTF8:


System.setOut(new PrintStream(System.out,true,"UTF8"));
System.setErr(new PrintStream(System.err,true,"UTF8"));

Scanner scanner = new Scanner(new File(fileName), "UTF8");

while(scanner.hasNextLine())
   {
    // Read input lines,
    String line = scanner.nextLine();
    line = doSomething(line);
    // Write some output to STDOUT/STDERR
    System.out.println(line);
    ...
   }

The boolean flag of the second constructor argument of PrintStream activates autoflush, but one does not need to use this argument.

Sunday, 25 November 2007

Correct case with Java's Locale

In Turkish, the uppercase version of 'i' is 'İ' (not 'I'). The problem is that the Turkish and the "ordinary" Latin 'i' is the same character (the same Unicode code point). If you upcase the 'i' in a Turkish context using the default settings, you might get the wrong letter.

In Java, you can use the Locale class to get this right:


Locale tr = new Locale("tr"); //Turkish
String trI = "i".toUpperCase(tr);
System.out.println(trI);

The above code outputs

İ

(and not 'I').

Be aware that comparing Turkish strings may not work flawlessly. See also this post.

You should also notice that changing the Locale does other things as well. For instance you might end up getting error messages in Turkish...

Tuesday, 6 November 2007

Unicode mystery: Identical typeface, different characters

Do the following two strings look similar to you?

уeЕoОxХaАM
yеEоOхXаAМ

Sorry, but they aren't. Not one of the corresponding characters of those strings are equal, but for the eye. For example, 'y' and 'у' have different Unicode codes (0079 and 0443).

The fact is that both strings contain a mix of Latin and Cyrillic characters, but not the same mix. There are a number of Cyrillic and Latin characters that look the same, but aren't. Comparing these strings using, for example, Java's String.equals method will return false. Sorting strings of mixed character encodings (but with identically looking type faces) will produce odd results. Latin P is quite different from the Russian Р, etc.

Giving it a bit of thought, it is not strange at all, but the first time you run into the problem, it can be quite tough to figure it out. It's like the first time you accidentally activated the "Insert" key on you keyboard, and thought that your computer was broken. (By the way: "Insert", "Scroll Lock", "Pause/Break"... What on earth are these keys doing on my keyboard...?!)

If you are dealing with (language) data of different character encodings, it's wise to validate your strings, to ensure that, e.g., a Russian string contains only Cyrillic characters, and that the Western European ones contain only Latinos (see this related post).

And don't miss the 'A' example in the comment below!

Nikoloogle Lindbloogle