Friday, 19 May 2017

ᚠᚢᚦᚬᚱᚴ Go strings and runes: Watch out for len(str)!

In the Go programming language, a string is made up of bytes, not characters. Sort of.

Beware of calling len() on a string

Consider the string "kääntäjä". It has eight characters --- and it means 'translator' in Finnish --- but when I put this string into a Go program, and check its length using the built-in len() function, I get 12, not 8:


        s := "kääntäjä"
        l := len(s)     // 12

https://play.golang.org/

The length of "jaa", using len(), yields the expected 3. But len("jää") returns 5!
(I'm told that "jää" means 'ice' in Finnish.)

Indexing into a string is a similarly unrewarding exercise:


        s := "ä"
        l := len(s)
        fmt.Println(l)    // 2
        fmt.Println(s[0]) // 195
        fmt.Println(s[1]) // 164


https://play.golang.org/

The "ä" single character string, seems to be made up of two different integers...?!

If you are mostly interested in strings as a representation of text --- as a sequence of (alphabetic) characters --- you should not use len() this way, or index into a string as above. The reason is that what may look like a string of characters is an array of bytes, in which each byte may or may not correspond to an actual character in your string.

UTF-8 uses a scheme of variable numbers of bytes to represent different parts of the Unicode character tables. The ASCII characters, a-z, 0-9 and a few other, only take one byte to encode, but other characters may take more than one.

(UTF-8 handles this in some clever way, so that only the first few bits have to be inspected, to figure out how many bytes a character is made up of. I think.)

Strings as runes (no, not the old Norse kind)

However, if you loop over a string using Go's built in range function, you will get the characters of the string, one by one. Or rather, the unique Unicode code point for each character. The snippet below loops over a string, and prints the indices and characters one by one. You can use the %c Printf formatting to turn a Unicode code point into an actual character:

        s := "jää"
        for i, r := range s {
          fmt.Printf("%d %c\n", i, r)
        }

// Prints:
// 0 j
// 1 ä
// 3 ä

https://play.golang.org/

Notice how the indices of the range loop above skips a number (from 1 to 3), since the "ä" character (rune) is made up of two bytes.

The range loop turns the string into a sequence of runes. A Go "rune" should not be confused with old Scandinavian runes (ᚠᚢᚦᚬᚱᚴ, ...), but that could have been fun. A rune in Go is merely a data type that holds an integer. This integer represents a character, a Unicode code point.

        var r rune
        r = 78
        fmt.Printf("%c\n", r)  // Prints N

https://play.golang.org/

Notice that since a rune is just an integer, you can assign an illegal value, not representing an actual Unicode character, to it: for example r = -765.
Once in a string, an invalid code point will somehow turn into the  character ('\ufffd').

Counting characters in strings

There are different ways to count the characters (runes) of strings. One way is to convert a string into a sequence of runes:


        s := "Motörhead play Björk"
        r := []rune(s)

        fmt.Println(len(s)) // 22 (Bleh!)
        fmt.Println(len(r)) // 20 (Yay!)

https://play.golang.org/


Another way to count characters is to import "unicode/utf8" and call utf8.RuneCountInString:

        utf8.RuneCountInString("Motörhead play Björk") // 20


(You can also loop over a string using "range", as above, and count the characters one by one.)


Runes to string

You can convert a sequence of runes back into a string using string(runes):

       string([]rune{66, 106, 246, 114, 107}) // "Björk"