Thursday 22 November 2007

Identifying Unicode code blocks in Java

With the help of Java's Character class, one can identify to what code block a unicode character belongs. This may be useful when, for example, validating a string in order to find peculiar mixtures of character code blocks (see an example in a previous post).

The following code

Character.UnicodeBlock ub = null;

ub = Character.UnicodeBlock.of('\u042F');
System.out.println(ub);

ub = Character.UnicodeBlock.of('۲');
System.out.println(ub);
outputs
CYRILLIC
ARABIC


This is a method returning all code blocks for the characters of a string:

Set<UnicodeBlock> getUnicodeCodeBlocks(final String s)
{
Set<UnicodeBlock> result = new HashSet<UnicodeBlock>();
for(char c : s.toCharArray())
{
result.add(Character.UnicodeBlock.of(c));
}
return result;
}

No comments: