Sunday 25 November 2007

Correct case with Java's Locale

In Turkish, the uppercase version of 'i' is 'İ' (not 'I'). The problem is that the Turkish and the "ordinary" Latin 'i' is the same character (the same Unicode code point). If you upcase the 'i' in a Turkish context using the default settings, you might get the wrong letter.

In Java, you can use the Locale class to get this right:


Locale tr = new Locale("tr"); //Turkish
String trI = "i".toUpperCase(tr);
System.out.println(trI);

The above code outputs

İ
(and not 'I').

Be aware that comparing Turkish strings may not work flawlessly. See also this post.

You should also notice that changing the Locale does other things as well. For instance you might end up getting error messages in Turkish...

Friday 23 November 2007

Sun's javac and source file character encoding

Sometimes you may need to tell the Java compiler about the character encoding of the source files. For example, if an ISO-8859-1 encoded source file is compiled in a UTF-8 environment, any funny (non-ASCII) characters may become a problem.

This is an example of how you can tell Sun's javac about the source file encoding:


javac -encoding iso8859-1 <FILE PATH(s)>

(Without the encoding switch, javac uses your system's default encoding.)


This way, you can tell Sun's JVM to expect UTF-8 IO:
java -Dfile.encoding=utf8

Thursday 22 November 2007

Identifying Unicode code blocks in Java

With the help of Java's Character class, one can identify to what code block a unicode character belongs. This may be useful when, for example, validating a string in order to find peculiar mixtures of character code blocks (see an example in a previous post).

The following code

Character.UnicodeBlock ub = null;

ub = Character.UnicodeBlock.of('\u042F');
System.out.println(ub);

ub = Character.UnicodeBlock.of('۲');
System.out.println(ub);
outputs
CYRILLIC
ARABIC


This is a method returning all code blocks for the characters of a string:

Set<UnicodeBlock> getUnicodeCodeBlocks(final String s)
{
Set<UnicodeBlock> result = new HashSet<UnicodeBlock>();
for(char c : s.toCharArray())
{
result.add(Character.UnicodeBlock.of(c));
}
return result;
}

Tuesday 20 November 2007

The ancient art of bashing the Danish language

This is what a Swede (Dr Hemming Gadh) is said to have said about the Danish language in the year 1510 (translation below):

‘Der till medh: så wærdas de icke heller att talla som annat folck, utan tryckia ordhen fram lika som the willia hosta, och synas endeles medh flitt forwendhe ordhen i strupan, for æn de komma fram, sammaledes wanskapa the munnen, då the talla, wridhan och wrengan, så att the draga then offwra leppen till then wenstra sidon och den nedra till then högra sidon, menandes dett wara sig en besynnerlighe prydning och wellståndh.’

The Swedish of the section above is quite weird in itself. The following English translation is lifted from Syllabic and morphological structure: what can be learnt from their interaction in Danish? (Hans Basbøll, 2006):

‘Also this: nor do they [the Danes] stoop (‘worthy themselves’) to speak like other people, but press the words forward as if they will cough, and appear partly to deliberately turn the words around in the throat, before they come forward (i.e. out of the mouth), partly they misshape the mouth when they speak, twist it and sneer it, so that they pull the upper lip to the left side and the lower to the right side, thinking this to be a particular ornament and well-standing.’

The first section of the paper referred to above is "Can spoken Danish be understood?". The paper cites studies pointing to the fact "[...] that Danish is found (more or less) difficult by everybody" (‘everybody’ meaning other Scandinavians).

The old Danish-bashing quotation is also found in The Phonology of Danish, Hans Basbøll, Oxford University Press, 2005, p. 83.

The byte-order mark

Things you didn't know that you had to know about #42:

The BOM, or the byte-order mark.

The BOM is a hateful creature, that sits at the beginning of some of your files. It's there only to mess things up.

You cannot see the BOM, but when you do, it hurts your eye: .

The most rational action when running into the BOM in a UTF-8 file, is to blame the nearest Windows user, and then delete it (the BOM, not the user).

Don't ask me why the BOM made me remember this old tune:

Wednesday 7 November 2007

The file and iconv commands

file and iconv are two simple but useful commands handy when dealing with files of different character encondings.

file <file path>


The file command makes a guess of what kind of file a file is. If you are lucky it may, for instance, help you to find out that a text file is encoded using UTF16, or ISO-8859-1, etc.

iconv -f <current encoding> -t <target encoding> <file path>

The iconv command is useful for converting between different character encodings. For instance, if you have noticed (with the help of the file command above) that a Unicode file is in UTF16, but you want in to be UTF8, you may use the iconv command:

iconv -f utf16 -t utf8 <file path>

(There are other, similar, commands, but for some reason iconv appears to be the only one that I can remember the name of.)

Tuesday 6 November 2007

The unicode command

I was just told about the unicode command.

On a Debian based system, try sudo apt-get install unicode.

Running the unicode command on the different characters 'y' and 'у' (see earlier post) looks like this:


nikolaj@fon:~$ unicode yу
U+0079 LATIN SMALL LETTER Y
UTF-8: 79 UTF-16BE: 0079 Decimal: y
y (Y)
Uppercase: U+0059
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)

U+0443 CYRILLIC SMALL LETTER U
UTF-8: d1 83 UTF-16BE: 0443 Decimal: у
у (У)
Uppercase: U+0423
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)

Unicode mystery: Identical typeface, different characters

Do the following two strings look similar to you?

уeЕoОxХaАM
yеEоOхXаAМ

Sorry, but they aren't. Not one of the corresponding characters of those strings are equal, but for the eye. For example, 'y' and 'у' have different Unicode codes (0079 and 0443).

The fact is that both strings contain a mix of Latin and Cyrillic characters, but not the same mix. There are a number of Cyrillic and Latin characters that look the same, but aren't. Comparing these strings using, for example, Java's String.equals method will return false. Sorting strings of mixed character encodings (but with identically looking type faces) will produce odd results. Latin P is quite different from the Russian Р, etc.

Giving it a bit of thought, it is not strange at all, but the first time you run into the problem, it can be quite tough to figure it out. It's like the first time you accidentally activated the "Insert" key on you keyboard, and thought that your computer was broken. (By the way: "Insert", "Scroll Lock", "Pause/Break"... What on earth are these keys doing on my keyboard...?!)

If you are dealing with (language) data of different character encodings, it's wise to validate your strings, to ensure that, e.g., a Russian string contains only Cyrillic characters, and that the Western European ones contain only Latinos (see this related post).

And don't miss the 'A' example in the comment below!

Monday 5 November 2007

Ubuntu 7.10 sluggified my laptop

Upgrading my Acer laptop from Ubuntu 7.04 (aka "Bibbly Bobbly") to 7.10 (aka "Sniggly Snoggly") significantly downgraded its GUI performance. It now takes twice as long to boot the laptop. It takes five times as long to bring up the desktop after log-in. Starting Firefox takes a millenium. I didn't even get the promised 3D Desktop jingle-jangle. All I got were some lousy desktop icons.

Furthermore, something happened to the networking, and it now takes quite a while for the poor thing to find the Internets. All together, this means that I have to get up no less than 2.5 minutes earlier each day, to meet my busy schedule.

Unsnappiness 2.0, and I want my money back!

However, when I nikoloogle for people with a similar experience, all I find is people going "Wooah, I got better performance with Ubuntu 7.10!". One wonders, I'm I alone in this?

Well, maybe not everyone praised its superior performance:

ERROR

Welcome to the least visited page of the Internets!