Friday, 5 September 2008

Case insensitive pattern matching of Unicode strings in Java

To make case insensitive pattern matching of Unicode strings in Java, you can call Pattern.compile with a second argument, like this:

Pattern p = 
Pattern.compile(patternString, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

(This is useful when dealing with non-ASCII/non-Latin1 text, such as Cyrillic. However, it may not work flawlessly for the Turkish Unicode characters.)

Update: I just learned that there is a nicer way of doing this: start the patternString above with "(?iu)":
Pattern p = 
Pattern.compile("(?iu)"+ patternString);


Anonymous said...

I use such code:

Pattern p = Pattern.compile("(?iu)^политика.*");
Matcher m = p.matcher("Политика");

And matches method returns false. Why?
Does not matter if I use "Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE" or "(?iu)". If I use "(?iu)^Политика.*" or "(?iu)^(П|п)олитика.*" it returns true.

Nikolaj Lindberg said...

Hi there Anonymous,

when I try your code, m.matches() returns true, as expected...