Tuesday 30 May 2017

Go: Checking what Unicode range a character belongs to

Update: There is another, better way to get the name of the Unicode range a rune belongs to than described below:


       import ("golang.org/x/text/unicode/runenames")

       ...

           name := runenames.Name('م') //ARABIC LETTER MEEM

       ...

https://play.golang.org



Below is an alternative way:

If you want to know what part of the Unicode table a character (rune) belongs to in Go, you can use the Scripts map found in the unicode package:


       r := 'ن' // The isolated form of Arabic 'n'

       for s, t := range unicode.Scripts {
           if unicode.In(r, t) {
               fmt.Println(s) // Arabic
           }
       }

https://play.golang.org/


The map unicode.Scripts contains the names of the different parts of the Unicode table, such as Latin, Greek, Arabic, Cyrillic, etc. Each such name is associated with a RangeTable, representing a subset of the Unicode character set. The unicode.In function in the snippet above checks whether a rune r is found in the RangeTable t.

Checking what part of the Unicode table a character belongs to, can be useful for validating that all characters of a string belong to the same script. For example, the Latin and Cyrillic scripts have characters that look identical, but are different characters. Examples are c-с, p-р and a-а. They may look identical, but are represented by different Unicode code points. If you mix Latin and Cyrillic characters in a string, you might for instance not find an expected match in a database search.


       c1 := 'c' // Latin
       c2 := 'с' // Cyrillic

       fmt.Println(c1 == c2) // false

       fmt.Printf("%U\n", c1) // U+0063
       fmt.Printf("%U\n", c2) // U+0441

https://play.golang.org/

Friday 19 May 2017

ᚠᚢᚦᚬᚱᚴ Go strings and runes: Watch out for len(str)!

In the Go programming language, a string is made up of bytes, not characters. Sort of.

Beware of calling len() on a string

Consider the string "kääntäjä". It has eight characters --- and it means 'translator' in Finnish --- but when I put this string into a Go program, and check its length using the built-in len() function, I get 12, not 8:


        s := "kääntäjä"
        l := len(s)     // 12

https://play.golang.org/

The length of "jaa", using len(), yields the expected 3. But len("jää") returns 5!
(I'm told that "jää" means 'ice' in Finnish.)

Indexing into a string is a similarly unrewarding exercise:


        s := "ä"
        l := len(s)
        fmt.Println(l)    // 2
        fmt.Println(s[0]) // 195
        fmt.Println(s[1]) // 164


https://play.golang.org/

The "ä" single character string, seems to be made up of two different integers...?!

If you are mostly interested in strings as a representation of text --- as a sequence of (alphabetic) characters --- you should not use len() this way, or index into a string as above. The reason is that what may look like a string of characters is an array of bytes, in which each byte may or may not correspond to an actual character in your string.

UTF-8 uses a scheme of variable numbers of bytes to represent different parts of the Unicode character tables. The ASCII characters, a-z, 0-9 and a few other, only take one byte to encode, but other characters may take more than one.

(UTF-8 handles this in some clever way, so that only the first few bits have to be inspected, to figure out how many bytes a character is made up of. I think.)

Strings as runes (no, not the old Norse kind)

However, if you loop over a string using Go's built in range function, you will get the characters of the string, one by one. Or rather, the unique Unicode code point for each character. The snippet below loops over a string, and prints the indices and characters one by one. You can use the %c Printf formatting to turn a Unicode code point into an actual character:

        s := "jää"
        for i, r := range s {
          fmt.Printf("%d %c\n", i, r)
        }

// Prints:
// 0 j
// 1 ä
// 3 ä

https://play.golang.org/

Notice how the indices of the range loop above skips a number (from 1 to 3), since the "ä" character (rune) is made up of two bytes.

The range loop turns the string into a sequence of runes. A Go "rune" should not be confused with old Scandinavian runes (ᚠᚢᚦᚬᚱᚴ, ...), but that could have been fun. A rune in Go is merely a data type that holds an integer. This integer represents a character, a Unicode code point.

        var r rune
        r = 78
        fmt.Printf("%c\n", r)  // Prints N

https://play.golang.org/

Notice that since a rune is just an integer, you can assign an illegal value, not representing an actual Unicode character, to it: for example r = -765.
Once in a string, an invalid code point will somehow turn into the  character ('\ufffd').

Counting characters in strings

There are different ways to count the characters (runes) of strings. One way is to convert a string into a sequence of runes:


        s := "Motörhead play Björk"
        r := []rune(s)

        fmt.Println(len(s)) // 22 (Bleh!)
        fmt.Println(len(r)) // 20 (Yay!)

https://play.golang.org/


Another way to count characters is to import "unicode/utf8" and call utf8.RuneCountInString:

        utf8.RuneCountInString("Motörhead play Björk") // 20


(You can also loop over a string using "range", as above, and count the characters one by one.)


Runes to string

You can convert a sequence of runes back into a string using string(runes):

       string([]rune{66, 106, 246, 114, 107}) // "Björk"