Tuesday, 30 May 2017

Go: Checking what Unicode range a character belongs to

If you want to know what part of the Unicode table a character (rune) belongs to in Go, you can use the Scripts map found in the unicode package:


       r := 'ن' // The isolated form of Arabic 'n'

       for s, t := range unicode.Scripts {
           if unicode.In(r, t) {
               fmt.Println(s) // Arabic
           }
       }

https://play.golang.org/


The map unicode.Scripts contains the names of the different parts of the Unicode table, such as Latin, Greek, Arabic, Cyrillic, etc. Each such name is associated with a RangeTable, representing a subset of the Unicode character set. The unicode.In function in the snippet above checks whether a rune r is found in the RangeTable t.

Checking what part of the Unicode table a character belongs to, can be useful for validating that all characters of a string belong to the same script. For example, the Latin and Cyrillic scripts have characters that look identical, but are different characters. Examples are c-с, p-р and a-а. They may look identical, but are represented by different Unicode code points. If you mix Latin and Cyrillic characters in a string, you might for instance not find an expected match in a database search.


       c1 := 'c' // Latin
       c2 := 'с' // Cyrillic

       fmt.Println(c1 == c2) // false

       fmt.Printf("%U\n", c1) // U+0063
       fmt.Printf("%U\n", c2) // U+0441

https://play.golang.org/