Nikoloogle Lindbloogle

Go: Checking what Unicode range a character belongs to

2017-05-30T20:30:00.000+02:00

Update: There is another, better way to get the name of the Unicode range a rune belongs to than described below:


       import ("golang.org/x/text/unicode/runenames")

       ...

           name := runenames.Name('م') //ARABIC LETTER MEEM

       ...

https://play.golang.org

Below is an alternative way:

If you want to know what part of the Unicode table a character (rune) belongs to in Go, you can use the Scripts map found in the unicode package:


       r := 'ن' // The isolated form of Arabic 'n'


       for s, t := range unicode.Scripts {
           if unicode.In(r, t) {
               fmt.Println(s) // Arabic
           }
       }

https://play.golang.org/

The map unicode.Scripts contains the names of the different parts of the Unicode table, such as Latin, Greek, Arabic, Cyrillic, etc. Each such name is associated with a RangeTable, representing a subset of the Unicode character set. The unicode.In function in the snippet above checks whether a rune r is found in the RangeTable t.

Checking what part of the Unicode table a character belongs to, can be useful for validating that all characters of a string belong to the same script. For example, the Latin and Cyrillic scripts have characters that look identical, but are different characters. Examples are c-с, p-р and a-а. They may look identical, but are represented by different Unicode code points. If you mix Latin and Cyrillic characters in a string, you might for instance not find an expected match in a database search.


       c1 := 'c' // Latin
       c2 := 'с' // Cyrillic

       fmt.Println(c1 == c2) // false

       fmt.Printf("%U\n", c1) // U+0063
       fmt.Printf("%U\n", c2) // U+0441

https://play.golang.org/

ᚠᚢᚦᚬᚱᚴ Go strings and runes: Watch out for len(str)!

2017-05-19T10:32:00.000+02:00

In the Go programming language, a string is made up of bytes, not characters. Sort of.

Beware of calling len() on a string

Consider the string "kääntäjä". It has eight characters --- and it means 'translator' in Finnish --- but when I put this string into a Go program, and check its length using the built-in len() function, I get 12, not 8:


        s := "kääntäjä"
        l := len(s)     // 12

https://play.golang.org/

The length of "jaa", using len(), yields the expected 3. But len("jää") returns 5!
(I'm told that "jää" means 'ice' in Finnish.)

Indexing into a string is a similarly unrewarding exercise:


        s := "ä"
        l := len(s)
        fmt.Println(l)    // 2
        fmt.Println(s[0]) // 195
        fmt.Println(s[1]) // 164

https://play.golang.org/

The "ä" single character string, seems to be made up of two different integers...?!

If you are mostly interested in strings as a representation of text --- as a sequence of (alphabetic) characters --- you should not use len() this way, or index into a string as above. The reason is that what may look like a string of characters is an array of bytes, in which each byte may or may not correspond to an actual character in your string.

UTF-8 uses a scheme of variable numbers of bytes to represent different parts of the Unicode character tables. The ASCII characters, a-z, 0-9 and a few other, only take one byte to encode, but other characters may take more than one.

(UTF-8 handles this in some clever way, so that only the first few bits have to be inspected, to figure out how many bytes a character is made up of. I think.)

Strings as runes (no, not the old Norse kind)

However, if you loop over a string using Go's built in range function, you will get the characters of the string, one by one. Or rather, the unique Unicode code point for each character. The snippet below loops over a string, and prints the indices and characters one by one. You can use the %c Printf formatting to turn a Unicode code point into an actual character:


        s := "jää"
        for i, r := range s {
          fmt.Printf("%d %c\n", i, r)
        }

// Prints:
// 0 j
// 1 ä
// 3 ä

https://play.golang.org/

Notice how the indices of the range loop above skips a number (from 1 to 3), since the "ä" character (rune) is made up of two bytes.

The range loop turns the string into a sequence of runes. A Go "rune" should not be confused with old Scandinavian runes (ᚠᚢᚦᚬᚱᚴ, ...), but that could have been fun. A rune in Go is merely a data type that holds an integer. This integer represents a character, a Unicode code point.

        var r rune
        r = 78
        fmt.Printf("%c\n", r)  // Prints N

https://play.golang.org/

Notice that since a rune is just an integer, you can assign an illegal value, not representing an actual Unicode character, to it: for example r = -765.
Once in a string, an invalid code point will somehow turn into the � character ('\ufffd').

Counting characters in strings

There are different ways to count the characters (runes) of strings. One way is to convert a string into a sequence of runes:


        s := "Motörhead play Björk"
        r := []rune(s)

        fmt.Println(len(s)) // 22 (Bleh!)
        fmt.Println(len(r)) // 20 (Yay!)

https://play.golang.org/

Another way to count characters is to import "unicode/utf8" and call utf8.RuneCountInString:

        utf8.RuneCountInString("Motörhead play Björk") // 20

(You can also loop over a string using "range", as above, and count the characters one by one.)

Runes to string

You can convert a sequence of runes back into a string using string(runes):

       string([]rune{66, 106, 246, 114, 107}) // "Björk"

Go 1.8 sort.Slice

2017-02-17T09:57:00.000+01:00

Go version 1.8 was published February 16 2017.

For me, the most noteworthy update was that a sort.Slice function has been added to the standard library.

Now you can sort a Go slice (list), without losing the will to live.
Go 1.8 will save lives.

Scala blunder: appending to a Seq that is a List

2011-10-21T15:50:00.000+02:00

I recently made a mistake in a loop reading lines from a file, doing some string manipulation and adding the result to a collection. A seemingly trivial Scala script just refused to halt. My mistake is illustrated by the following two toy examples, adding integers to a Seq and a Vector, respectively:

var x1 = Vector[Int]()
for(i <- 0 to 100000) { x1 = x1 :+ i }

var x2 = Seq[Int]()
for(i <- 0 to 100000) { x2 = x2 :+ i }

One of the above for loops runs about 38,648 times slower than the other one (according to a single, somewhat sloppy benchmark using Scala 2.9.1). The explanation, I believe, is that the Seq turned out to be backed by a List. Lists hate being appended to (:+), and this hatred manifests itself in bad performance. Good to know if you want a program to be impressingly slow.

By the way, this made me think of another one:

var s1 = ""
for(i <- 0 to 100000) s1 = s1 + i

var s2 = ""
for(i <- 0 to 100000) s2 = s2.concat(i.toString)

I don't know why you'd want to create a string like the above, but the version using + is about four times slower than the one using concat (Scala 2.9.1).

Programming Scala without... anything

2011-09-28T13:51:00.000+02:00

You don't need all that fancy, modern stuff. A keyboard, a terminal window and the scala command are all you need:

$ scala -e 'println(io.Source.fromFile("freq_list.txt").getLines().map(_.split("\t")(0).toInt).sum)'
71213401

(Prints the result of summing the frequency numbers found in the first tab separated field of file freq_list.txt. The result turned out to be 71213401.)

When the programs get longer, you better stay focused.

Testing Scala 2.9.0 (RC2) parallel collections: four extra key strokes, double speed

2011-04-29T19:20:00.003+02:00

We have just tried the new parallel collections that you can find in Scala 2.9.0.RC2.

By adding .par at a few places, the software we tested ran almost twice (1.9 x) as fast on a two core processor. Running the same code on a four core processor was, as expected, quicker (2.7 x), but not four times as fast. That's quite a performance boost, with close to zero programming effort.

The software we've tested validates (electronic) pronunciation dictionaries, where each entry has an orthography, a phonetic transcription and some other stuff. The program runs a large number of quality checks to find problems (faulty transcriptions, inconsistencies, etc) that are hard or impossible for a human lexicographer to find. It runs hundreds or even thousands of validation rules, using regular expressions and other string processing, on a hundred thousand or more dictionary entries.

The software runs a sequence of validation rules on each input entry. The validation rules are independent of each other, suitable for running in parallel. The rules, living in a Seq, are applied in sequence in a call to map(...). By calling .par.map(...) on the Seqs holding the validation rules, a multi-core processor is now able to perform the validation in parallel (par returns a parallel version of a collection).

Apart from using parallel collections at the point where the validation rules are run, we also run the main loop, reading the input lexicon data, using a parallel collection. Adding parallel collections at different places (the outermost loop and inside the validation) seems to add to the performance gain.

An initial problem that we had, was that the Scala 2.9.0.RC2 API documentation fooled us to believe that foldLeft would, just like map, run in parallel. That appears to be incorrect. We had to change calls to foldLeft into calls to map (followed by an additional foldLeft to aggregate the result). I don't know if I've misunderstood the documentation, or if parallel foldLeft is pending.

Anyway, double speed, or more, with zero effort. It sounds too good to be true, but this quick test suggests that it works like a charm.

And now I want more cores.

Interview with Maxime Lévesque, author of Squeryl

2010-09-15T10:20:00.000+02:00

Squeryl is a great Scala database API. On its website, it is describe like this: "A Scala ORM and DSL for talking with Databases with minimum verbosity and maximum type safety".

Preparing an introduction to Squeryl for a Swedish computer magazine, I sent a number of questions to Maxime Lévesque, the man behind Squeryl. The answers were so interesting, that I asked his permission to post them here:

Could you describe yourself in a few words?

I'm a dad, a programmer, a hobbyist bass player and percussionist.

I'm the kind of programmer who prefers to write libraries and frameworks to writing applications. If I was in the construction industry I'd probably be making bricks, mortar and nails rather than houses.

Do you develop Squeryl as part of your work, or is it a hobby?

Squeryl started as a hobby, only later did I start using it in a commercial project.

What are the most important features of Squeryl? Why should you use it?

The main reason to use Squeryl in an application, in my opinion, is to have the data access code validated by the compiler. I've seen many projects where the database schema stops evolving after a lot of code has been written against it. Ugly workarounds are sometimes chosen because there isn't enough time to investigate the repercussions of a schema change or conduct all the testing required.

Strongly typed languages are good for "deterministic refactoring". A data access layer needs to be refactorable, as any part of a system does. Perhaps to an even greater extent, because in a sense, bad design decisions get persisted with the data.

A developer needs all the help he can get from tools such as compilers and IDEs. Hard work and discipline don't scale. Why rely on it when you can have automated validation?

Reusability is another big one. Squeryl queries are composable, reusable pieces of code. A query that encodes a particular piece of application logic needs only be written once, and reused anywhere it is needed. I'm a big believer in the DRY principle (Don't Repeat Yourself).

Low verbosity would be another strength. I dislike APIs or frameworks that require you to write more than you should.

What's the story behind Squeryl?

In 2005 I wrote an ORM for dotNet. I was in need of one at the time and I couldn't find a decent one that exploited generics and annotations, so I wrote my own. By the time I considered publishing it, LINQ came out, and instantly obsoleted my ORM (and all other ORMs except HaskelDB in my opinion).

A few years later I started to write a query DSL in Java, and at every step, I got bitten by language limitations. Every time I worked around them, the solution became a bit more ugly and verbose. I then discovered Scala, and started experimenting with writing a statically typed query DSL. I was amazed by the expressivity of the language.

The fact that it was possible to write Squeryl as a library (i.e., without a compiler plug-in) speaks a lot about the potency of the language. The first two attempts were abandoned when they reached a critical level of inelegance. They were Squeryl's pre-history.

Squeryl is in fact my third attempt at a Scala ORM. When I became confident that a fourth rewrite wouldn't be necessary, I published it on GitHub.

If Squeryl didn't exist, what would you use?

If Squeryl didn't exist, I'd have a look at ScalaQuery or Circumflex. I only have a superficial knowledge of them, but I would surely try them out before going to any of the Java based ORMs.

If you are to demo Squeryl (e.g., to a Java programmer), do you have a favourite example?

Here's a one liner that says a lot :

val avgHeight: Option[Float] = 
  from(people)(p => compute(avg(p.heightInCentimeters)))

Apart from the shortness of the code, we can see a few implicit conversions at work. The compiler "knows" that the sum query can translate into a 32 bit floating point value, but it also "knows" that it is an Option[], because the avg aggregate function is not guaranteed to return something (the table can be empty). In fact it won't compile if you try to refer to it as a (non Option[]) Float.

Where has Squeryl turned up? Who uses it?

I haven't made any survey, it's on my todo list, but I've exchanged emails with developers that are building systems with Squeryl in fields ranging from finance to bioinformatics.

I read something about Lift...?

Ross Mellgren from the Lift team has written an integration module that is part of Lift 2.1 (release candidate).

What's on the roadmap?

High on my priority list is free text search (backed by Lucene). Longer term I'd like to add things like support for sharding and extending the DSL to exploit the geospatial capabilities of databases like Postgres, Oracle and H2.

Is it of any importance that Squeryl was written i Scala? Or was this merely a coincidence?

Without Scala there wouldn't be a way to have strongly typed queries on the JVM without having verbosity that reaches a caricatural level. Not only wouldn't there be Squeryl, but there wouldn't be anything like it.

When Java came out I was impressed with all the features it had built in: serialization, RMI, garbage collection, portability. It was in its time a game changing technology. Today I have the same impression of Scala: the level of static validation that it gives you, all this with minimal verbosity. If I could say just one thing to qualify it, I'd have to say: game changing.

So the answer is yes, Scala made Squeryl possible. I expect a lot of interesting Scala DSLs will get written in many domains in the coming years. I have a few other DSLs I'd like to write myself.

Any particular advice for someone beginning with Squeryl?

I would just copy an example from the Squeryl site, and modify it gradually so that it becomes your own schema. And most importantly, don't hesitate to ask questions in the discussion groups. I'm often impressed by the quality of the answers given by the community.

Thanks a lot for the great answers!

Using the Scala REPL to tell the difference between ЕКАТEРИНБУРГ and ЕКАТЕРИНБУРГ

2010-05-06T15:46:00.013+02:00

Sometimes, one runs into UTF-8 strings with characters from different code blocks. This is problematic in cases where the fonts look the same, but the characters are different. The Scala REPL is handy for finding out what Unicode block each character in a string belongs to. Let's use "ЕКАТEРИНБУРГ" and "ЕКАТЕРИНБУРГ" as examples:

scala> "ЕКАТEРИНБУРГ" == "ЕКАТЕРИНБУРГ"
res0: Boolean = false

scala> import java.lang.Character.UnicodeBlock
import java.lang.Character.UnicodeBlock

scala> "ЕКАТEРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
E BASIC_LATIN
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala> "ЕКАТЕРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
Е CYRILLIC
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala>

The REPL exposed one of the seemingly identical strings to be an unhealthy mix of Latin and Cyrillic characters. Thanks, REPL.

A tiny Scala case class to clean up user input

2010-04-11T00:31:00.053+02:00

We needed some cleaning up of user input entered into a text field. We ended up with a Scala case class that cleans up its constructor string argument a bit, by removing multiple whitespace characters and trimming it. It behaves like this:

scala> Text("    a       a      ") == Text("a a")                                                                
res0: Boolean = true
scala> Text("   a      a       ").text == Text("a a").text                                                      
res1: Boolean = true
scala> Text("  a     a    ").text          
res2: java.lang.String = a a

The code looks like this:

case class Text(private var _text: String) {
 val text = _text.trim.replaceAll(" +", " ")
 _text = text
}

Since the input string, var _text, is private, we can manipulate it a bit, without making it possible for others to tamper with. I'm not sure if this is the obvious way to do it, but it seems to work as intended.

We tried a similar version that did not work:

// Doesn't work
case class BrokenText(private var _text: String) {
  _text = _text.trim.replaceAll(" +", " ")
  val text = _text
}

This version does not work since Text.text will return the original string, not the cleaned up one:

scala> BrokenText("    a      b      ")
res0: BrokenText = BrokenText(a b)
scala> res0.text
res1: String =     a      b    
scala>

Why the second version doesn't work? Beats me. (But I'm sure the answer will turn out to be obvious.)

Update: See the two anonymous comments below: one answering my question above, the other one suggesting a neater way of handling it. Thanks.

Scala: Getting into performance trouble, calling head and tail on an ArrayBuffer

2010-02-04T09:50:00.003+01:00

Update: The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.

====================================

Recently, I wrote the following two different versions for doing the same thing (compute frequencies):

// Version 1  --- Don't do this, lousy performance
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = { 
 import annotation._
 @tailrec
 def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
   seq match {
     case s if s.isEmpty => map
     case s => {
       val elem = s.head
       val n = map.getOrElse(elem, 0) + 1
       freq(s.tail, map + (elem -> n ))
     }
   }
 }
 freq(seq, Map())
}

// Version 2 --- 260 times faster than Version 1 on some input
def freq[T](seq: Seq[T]): Map[T, Int] = {
 val freqs = collection.mutable.HashMap[T, Int]()
 for(elem <- seq) {         
   val n = freqs.getOrElseUpdate(elem, 0)         
   freqs.update(elem, n + 1)     
 }     
 // Return immutable copy of freqs     
 Map() ++ freqs
}

When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:

val linesList = io.Source.fromPath("testfile.txt").getLines().toList
val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq

Version 1 called with linesSeq as input, performes horrlibly compared to when called with linesList. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of Seq is an ArrayBuffer, and 2) Calling head and tail on an ArrayBuffer is costly. The same operations are cheap on a List. That's why Version 1 above is a performance trap.

A possible way of getting better performance, is to change the inner, two argument, freq method to use List, instead of Seq:

// Version 1.b  --- Somewhat better
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = { 
 import annotation._
 @tailrec
 def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {
   seq match {
     case s if s.isEmpty => map
     case s => {
       val elem = s.head
       val n = map.getOrElse(elem, 0) + 1
       freq(s.tail, map + (elem -> n ))
     }
   }
 }
 freq(seq.toList, Map())
}

Better yet --- in Scala 2.8 --- is to scrap the entire method, and call groupBy(identity).mapValues(_.length) directly on the Seq...

Counting Strings and Things in Scala (2.8)

2010-02-01T22:26:00.041+01:00

I often need to count the frequencies of strings ("words", typically). Below are a few Scala snippets for counting strings and things. (Don't miss the last one.)

First try

Let's start with a method for counting string frequencies in a list:

// Scala 2.8
def freq(wds: List[String]): Map[String, Int] = { 
  import annotation._
  @tailrec
  def freq(wds: List[String], map: Map[String, Int]): Map[String, Int] = {
    wds match {
      case l if l.isEmpty => map
      case l => {
        val elem = l.head
        val n = map.getOrElse(elem, 0) + 1
        freq( l.tail, map + (elem -> n ) )
      }
    }
  }
  freq(wds, Map())
}

It takes a list of strings, and returns a map (hash table) with a frequency count for each unique string. The one argument freq method contains an embedded two argument freq method. The second method recursively consumes elements of the list, incrementing the frequency count of the second accumulator argument. The two argument method is initialised with an empty map (at the end of the one argument method, freq(wds, Map()).

In each recursion, a new, immutable word frequency map is produced, with the incremented frequency count. The

import annotation._
@tailrec

part tells the compiler to check whether it can optimize the tail recursive call or not. (The Scala compiler can optimize a special case of tail recursion.)

If the recursion makes you dizzy, you can use a mutable HashMap instead:

def freq(wds: List[String]): Map[String, Int] = {
  val freqs = collection.mutable.HashMap[String, Int]()
  for(w <- wds) {     
    val n = freqs.getOrElseUpdate(w, 0)     
    freqs.update(w, n + 1)   
  }   
  // Return immutable copy of freqs   
  Map() ++ freqs 
}

Second try

You'll soon find out that the above code is limited, since it only accepts List input. There is a more general concept, Seq, that will make it possible to call freq with different kinds of sequences (lists, listbuffers, arrays):

// Scala 2.8
def freq(wds: Seq[String]): Map[String, Int] = {  
  import annotation._
  @tailrec
  def freq(wds: Seq[String], map: Map[String, Int]): Map[String, Int] = {
    wds match {
      case l if l.isEmpty => map
      case l => {
        val elem = l.head
        val n = map.getOrElse(elem, 0) + 1
        freq( l.tail, map + (elem -> n ) )
      }
    }
  }
  freq(wds, Map())
}

Third try

One day you find yourself relocated from the word counting department to the character counting department. A string is a sequence, but of Chars, not Strings. The code above will not help you count character frequencies. Here is an attempt at generalising the code further, to make it able to count the frequencies of any thing, T, not just String:

// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {  
  import annotation._
  @tailrec
  def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
    seq match {
      case s if s.isEmpty => map
      case s => {
        val elem = s.head
        val n = map.getOrElse(elem, 0) + 1
        freq(s.tail, map + (elem -> n ))
      }
    }
  }
  freq(seq, Map())
}

Here's the more general non-tail-recursive version:

def freq[T](seq: Seq[T]): Map[T, Int] = {
  val freqs = collection.mutable.HashMap[T, Int]()
  for(elem <- seq) {     
    val n = freqs.getOrElseUpdate(elem, 0)     
    freqs.update(elem, n + 1)   
    }   
  // Return immutable copy of freqs   
  Map() ++ freqs 
}

Hooray.

Last try (shamelessly lifted from someone at #scala)

But... you can still do better than this. A while ago, someone on the #scala irc channel (unfortunately, I don't remember this persons name) answered a question on how to associate each integer in a sequence with the number of times each integer occurred (or something like that). It turns out that, in Scala 2.8, it is possible to write a frequency counting thing even more compactly:

def freq[T](seq: Seq[T]) = seq.groupBy(x => x).mapValues(_.length)

It's so short, that it's almost not worth defining a method/function for it. You can simply call .groupBy(x => x).mapValues(_.length) directly on your Seq. (Or groupBy(identity).mapValues(_.length), which is the same thing.)

Double hooray.

Benchmarking is tricky, but a small test indicates that the last, most beautiful, version is also the quickest, and that the recursive ones using only immutable maps (in some situations) are quite slow.

Beware! scala.swing.TextField proclaims EditDone when it isn't

2009-11-22T16:33:00.023+01:00

Update: Forget about EditDone. See Update below!

scala.swing.TextField is a basic GUI component that can be used for
letting the user input a line of text. When listening to this component, one can react to an EditDone event:

// Inside some GUI component ...
val textField = new TextField(20)
contents += textField
listenTo(textField)

reactions += {case EditDone(`textField`) =>
  println("Ok, searching DB for input "+ textField.text)
}
//...

Fine. Whenever the user (me) hits the Enter key, the message, "Ok,
searching for DB input ...", simulating a database search, is printed.

However, what happens when some unrelated software product suddenly
pops up a window while the user (me) is still inputting text into the
TextField? I tell you what: The evil, non-sentient contraption prints
the simulated search message --- just as if I had hit Enter.

When the TextField loses focus, it emits an EditDone event. But I'm
not done editing. I've only typed "a". I was about to type
"abecedarian". Now the silly thing will search the database for all
words containing the letter "a". I never told it to do that. This
happened just because some other, unrelated, ill-behaving program
grabbed the focus.

Of course, the focus may also be lost because the user voluntarily
changes windows (for instance, in order to Google for "abecedarian").

As far as I can tell, there is no sane way to tell an EditDone event
produced by the user (me) hitting Enter from an EditDone event
produced because the TextField component lost focus. This cannot be
right.

(A while ago, I asked about this on the Scala-user list. Not one single
answer from one single soul in the entire Universe. It feels lonely.)

(I'm using Scala 2.8.)

Update: Forget about EditDone.

What you should do, is not to listen to the TextField, but to TextField.keys. This way, you'll be able to catch a KeyPressed event, and check if the key pressed was Enter. Simple.

It's a bit tricky to figure out, however, since it's not in the TextField Scala docs (you'll have to find your way to scala.swing.Component). This is how it could look:

import swing._
import event._

//...

// Inside some GUI component ...
val textField = new TextField(20)
contents += textField

listenTo(textField.keys)

import Key._
reactions += {case KeyPressed(`textField`, Enter, _, _) =>
  println("Ok, searching DB for input "+ textField.text)
}
//...

Thanks to Ingo Maier for explaining this.

Source's getLines in Scala 2.8 now strips line end

2009-11-16T15:10:00.018+01:00

In Scala 2.8 (not yet officially released), scala.io.Source has been updated.

When reading lines from a file, you do not longer need to trim the lines, since newlines are removed by default. The code to read lines from a file using Source may now look something like this (where fName is a file name (a string)):

val lines = io.Source.fromPath(fName) getLines()

If you want to specify the input file encoding to be UTF-8, you could try this:

val lines = io.Source.fromPath(fName)("UTF8") getLines()

When you look at the API documentation, you'll find that fromPath takes a Codec as a second implicit parameter. Through some mysterious conversion (or "implicit conversion"), you can call it with a string ("UTF8") instead, as in the example above.

Anyway, no more Source.fromFile(fName).getLines.map(_.stripLineEnd). Someone is improving Scala!

Cracker: "Tired of coding Perl"

2009-08-17T11:05:00.004+02:00

Perl is doomed.

Even Cracker is tired of it. The evidence is found in a song on their latest album, where they sing "I'm tired of coding Perl, tired of V.B.A."

Take a look some 30 seconds into the video. The guy browsing the mod_perl Developer's cookbook is not happy. The "Turn on, tune in, drop out with me" video is here.

Perl is doomed.

Scala case classes don't have auxiliary constructors?

2009-07-29T16:00:00.008+02:00

The lesson of today, is that Scala case classes don't appear to have auxiliary constructors.

In Scala, auxiliary constructors may be added to a class by defining a "this" method:


scala> class AClass(s1: String, s2: String) {
  def this(s: String) = this(s, "default")
}
defined class AClass

scala> new AClass("hey")
res0: AClass = AClass@187b5ff

Look what happens when you try the same trick on a case class:


scala> case class ACaseClass(s1: String, s2: String) {
  def this(s: String) = this(s, "default")
}
defined class ACaseClass

scala> ACaseClass("hey")
:7: error: wrong number of arguments for method apply: (String,String)ACaseClass in object ACaseClass
ACaseClass("hey")
^

The attempt at adding an auxiliary constructor compiles, but results in a runtime error.

Update: Oops, yes the can have auxiliary constructors --- see comment below, by jkriesten, straightening things out!

Update: Paul (see comment below) points to the following discussion on this topic http://www.scala-lang.org/node/976.

Printing the Unicode code points of UTF8 characters (Scala)

2009-06-09T21:53:00.024+02:00

Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)

Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the 'ä' character, like this:

scala> "%04X".format('ä'.toInt)
res0: String = 00E4

scala>

Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:

scala> "ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"\
.map(c => "%s\t\\u%04X".format(c, c.toInt))\
.foreach(println)
ɸ \u0278
β \u03B2
f \u0066
v \u0076
θ \u03B8
ð \u00F0
s \u0073
z \u007A
ʃ \u0283
ʒ \u0292
ʂ \u0282
ʐ \u0290
ç \u00E7
ʝ \u029D
x \u0078
ɣ \u0263
χ \u03C7
ʁ \u0281
ħ \u0127
ʕ \u0295
ʜ \u029C

scala>

(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)

Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:

scala> var v = "\u0278"
v: java.lang.String = ɸ

scala>

In Java, it looks similar, but you have to cast your chars to ints:

String.format("%04X", (int) 'ä'), etc.

The perils of changing the case of UTF8 strings

2009-03-24T09:26:00.033+01:00

Below are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing joke in an earlier post).

In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).

Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.

The table was generated using Scala (thus Java) strings. The column EqIgnoreCase reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's equalsIgnoreCase. The two rightmost columns present the length of the string before and after changing the case up and down again.

Orig	UpCase ↑	UpDown ⇅	EqIgnoreCase	OrigLen	NewLen
ß	SS	ss	false	1	2
ςσ	ΣΣ	σς	true	2	2
ΰΐ	Ϋ́Ϊ́	ΰΐ	false	2	6
iİıI	IİII	iiii	true	4	4
iİıI	İİII	iiıı	true	4	4

The lesson? Nothing special. That you can do terrible things to strings. That changing the case of strings may be an irreversible operation. That if you are to normalize some text into either lower or uppercase, you might need to decide what's most suitable for a given language. That it might be a good idea to keep the original strings after normalization. That using the correct locale might help. That I'm not a graphical designer (the table is hideous).

Scala: Reversing a string by up- and then downcasing it

2009-03-08T00:14:00.005+01:00

Did you know that you can reverse a string by merely upcasing it and then downcasing it again? Here's an example:

scala> val s = "ςσσ"
s: java.lang.String = ςσσ

scala> s.toUpperCase.toLowerCase == s.reverse.toString
res0: Boolean = true

scala>

If you don't believe me, just copy and paste the two lines of code above into the Scala interpreter, and see it for yourself.

The Firebird database: Problem handling UTF8 characters

2009-03-05T11:26:00.014+01:00

The 'Latin capital letter I with dot above', İ (Unicode 0130), strikes again! This innocent looking Turkish character seems to be reliable when it comes to breaking software that should be able to handle UTF8. (See also this post for a Java example.)

This time it breaks the Firebird database (in my case, v2.1.1 on a 64-bit Debian system). Downcasing some random characters in a database configured to handle UTF8 works fine:

SELECT LOWER('AӴЁΪΣƓ') FROM RDB$DATABASE

returns the expected string, aӵёϊσɠ.

However, when you throw in the trouble-making İ, everything blows up:

SELECT LOWER('AӴЁΪΣƓİ') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -104
Invalid token

Engine Code    : 335544849
Engine Message :
Malformed string

Slightly different input, generates a different error message:

SELECT LOWER('İA') FROM RDB$DATABASE
*** IBPP::SQLException ***
Context: Statement::Fetch
Message: isc_dsql_fetch failed.

SQL Message : -802
Arithmetic overflow or division by zero has occurred.

Engine Code    : 335544321
Engine Message :
arithmetic exception, numeric overflow, or string truncation

There is an item on the Firebird user list, but without any answers so far.

Update: As mariuz points out in a comment below, this defect now seems to be fixed in an upcoming version. See this bug tracker item.

Book: Real World Haskell (not much real world so far :)

2009-01-06T13:06:00.020+01:00

I've just started to read Real World Haskell (the paper book). It seems like a nice book (except for a few irritating and confusing typos/mistakes at the start of the book).

However, I've read more than 100 pages so far, and still not a sign of any of the "real world" stuff promised by the title. I still don't know much or anything about IDE:s, how to compile the code, scripting, any practical details on how to structure your code into modules, or anything in that direction. So far, mostly (sometimes rather long-wined) discussions on specific (list) functions. One of the examples, end up in a conclusion that might be paraphrased as "by the way, don't use the function we've discussed the last few pages; in real world settings it doesn't work too well".

In the real world, you run into both needles and haystacks , occasionally, but that doesn't help making sense of

isInAny3 needle haystack = any (isInfixOf needle) haystack

And one more real world example of the kind zip3foobar "quux" and I may start losing interest... or just start screaming.

Well, the upcoming chapters have promising titles, so I guess I just have to keep reading. And I guess you have to start with the basics. Still, over 100 pages, and mostly foobars so far...

The book is available on-line.

Scala for small throw-away scripting tasks

2008-12-18T23:33:00.027+01:00

I've come to use Scala for tiny scripts to be thrown away after doing some small task. Typically this involves processing a few files, comparing some textual data, maybe extracting some fields of tab-separated files, etc. The kind of things that Perl used to be the obvious choice for.

Although lacking Perl's simplified syntax for iterating over all lines in files, Scala works quite nicely for small tasks.

For example, today I had to extract from a file all lines of four or more characters including only upper-case characters, and capitalize the output:

scala.io.Source.fromFile(args(0))
.getLines.map(_.stripLineEnd).filter(_.matches("[A-Z]{4,}"))
.map(_.toLowerCase.capitalize).foreach(println)

Not exactly a thing of beauty, but it only took a minute and it works. And it reminds me a bit of a classic Unix command line pipeline.

A few things on my wish-list to make Scala even better for small scripts:

A nicer way of setting the output character encoding (currently you have to do something like Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8")))
It would be great if Source.getLines could remove the new line character of each line
A better name for RichString.stripLineEnd (for some reason, it is totally impossible for me to remember the name of this method)
Maybe scripting support in the Scala Netbeans plugin? (Currently, I think the plugin wants you to put your code in a class/object)

Scala: Reading a tab separated file into a Map (first attempt)

2008-12-12T12:01:00.013+01:00

Below is my first attempt, in Scala, at reading a tab separated file into a map, where the first and second fields of the input file make up the key-value pairs.

There are probably better ways of doing it, but the following seems to work:

val keyValuePairs = scala.io.Source.fromFile(inputFileName, "UTF8")
       .getLines.map(_.stripLineEnd.split("\t", -1))
       .map(fields => fields(0) -> fields(1)).toList

val map = Map(keyValuePairs : _*)

The keyValuePairs:_* stuff is a way to call a variable length argument, the constructor of (the immutable) Map, with a list (keyValuePairs).

I'm pretty sure that there are neater ways of doing it. Furthermore, the above snippet does not do any sensible error checking or input validation (such as skipping empty line, for instance).

Intelligent Software: Netbeans (or JUnit?) can count to three!

2008-12-11T16:26:00.009+01:00

I just noticed a (very) small detail in Netbeans. I was adding some unit test, when I noticed that Netbeans can count to, at least, three.

When running a JUnit test suite of only one test, you get the message "The test passed". After adding another test, the message is "Both tests passed", then "3 tests passed", etc. (Well, of course, given that the tests pass.)

Now, that's what I call (artificial) intelligence.

Here's an unrelated article on counting to three (and more).

Scala: Beware of inadvertently shadowing variables

2008-12-09T22:01:00.010+01:00

I've just spent 15 minutes looking for a stupid mistake in some Scala code. The problem was that I had shadowed a variable.

In some situations in Scala, you are allowed to shadow variables. In other words, it is sometimes legal to give a new variable the same name as an existing one. This can lead to mistakes. The following legal code illustrates how you can shadow a method input variable:

def theShadow(list :Array[String]) : Seq[String] = {
  // Mistake! Inadvertently
  // shadowing the input parameter:
  val list = List("Asa", "nisi", "masa")
  list
 }

(The above is a very obvious example. When you make this mistake in real code, it will probably be in a less obvious context.)

Scala: XML serializer adds closing elements to empty elements

2008-12-09T12:48:00.017+01:00

When printing Scala XML nodes/elements, closing tags for empty elements are added, even if there weren't any in the input.

For example, if you input <childless/>, the XML processor will add a closing tag like this:

scala> val elem = <childless/>
elem: scala.xml.Elem = <childless></childless>

(The two versions of the XML element are equivalent, but sometimes it is practical to be able to do a simple string comparison of the input and output XML files. The added closing tags may make this harder.)

See this thread.