Wednesday 15 September 2010

Interview with Maxime Lévesque, author of Squeryl

Squeryl is a great Scala database API. On its website, it is describe like this: "A Scala ORM and DSL for talking with Databases with minimum verbosity and maximum type safety".

Preparing an introduction to Squeryl for a Swedish computer magazine, I sent a number of questions to Maxime Lévesque, the man behind Squeryl. The answers were so interesting, that I asked his permission to post them here:



Could you describe yourself in a few words?

I'm a dad, a programmer, a hobbyist bass player and percussionist.

I'm the kind of programmer who prefers to write libraries and frameworks to writing applications. If I was in the construction industry I'd probably be making bricks, mortar and nails rather than houses.


Do you develop Squeryl as part of your work, or is it a hobby?

Squeryl started as a hobby, only later did I start using it in a commercial project.


What are the most important features of Squeryl? Why should you use it?

The main reason to use Squeryl in an application, in my opinion, is to have the data access code validated by the compiler. I've seen many projects where the database schema stops evolving after a lot of code has been written against it. Ugly workarounds are sometimes chosen because there isn't enough time to investigate the repercussions of a schema change or conduct all the testing required.

Strongly typed languages are good for "deterministic refactoring". A data access layer needs to be refactorable, as any part of a system does. Perhaps to an even greater extent, because in a sense, bad design decisions get persisted with the data.

A developer needs all the help he can get from tools such as compilers and IDEs. Hard work and discipline don't scale. Why rely on it when you can have automated validation?

Reusability is another big one. Squeryl queries are composable, reusable pieces of code. A query that encodes a particular piece of application logic needs only be written once, and reused anywhere it is needed. I'm a big believer in the DRY principle (Don't Repeat Yourself).

Low verbosity would be another strength. I dislike APIs or frameworks that require you to write more than you should.


What's the story behind Squeryl?

In 2005 I wrote an ORM for dotNet. I was in need of one at the time and I couldn't find a decent one that exploited generics and annotations, so I wrote my own. By the time I considered publishing it, LINQ came out, and instantly obsoleted my ORM (and all other ORMs except HaskelDB in my opinion).

A few years later I started to write a query DSL in Java, and at every step, I got bitten by language limitations. Every time I worked around them, the solution became a bit more ugly and verbose. I then discovered Scala, and started experimenting with writing a statically typed query DSL. I was amazed by the expressivity of the language.

The fact that it was possible to write Squeryl as a library (i.e., without a compiler plug-in) speaks a lot about the potency of the language. The first two attempts were abandoned when they reached a critical level of inelegance. They were Squeryl's pre-history.

Squeryl is in fact my third attempt at a Scala ORM. When I became confident that a fourth rewrite wouldn't be necessary, I published it on GitHub.


If Squeryl didn't exist, what would you use?

If Squeryl didn't exist, I'd have a look at ScalaQuery or Circumflex. I only have a superficial knowledge of them, but I would surely try them out before going to any of the Java based ORMs.


If you are to demo Squeryl (e.g., to a Java programmer), do you have a favourite example?

Here's a one liner that says a lot :

val avgHeight: Option[Float] = 
  from(people)(p => compute(avg(p.heightInCentimeters)))

Apart from the shortness of the code, we can see a few implicit conversions at work. The compiler "knows" that the sum query can translate into a 32 bit floating point value, but it also "knows" that it is an Option[], because the avg aggregate function is not guaranteed to return something (the table can be empty). In fact it won't compile if you try to refer to it as a (non Option[]) Float.


Where has Squeryl turned up? Who uses it?

I haven't made any survey, it's on my todo list, but I've exchanged emails with developers that are building systems with Squeryl in fields ranging from finance to bioinformatics.


I read something about Lift...?

Ross Mellgren from the Lift team has written an integration module that is part of Lift 2.1 (release candidate).


What's on the roadmap?

High on my priority list is free text search (backed by Lucene). Longer term I'd like to add things like support for sharding and extending the DSL to exploit the geospatial capabilities of databases like Postgres, Oracle and H2.


Is it of any importance that Squeryl was written i Scala? Or was this merely a coincidence?

Without Scala there wouldn't be a way to have strongly typed queries on the JVM without having verbosity that reaches a caricatural level. Not only wouldn't there be Squeryl, but there wouldn't be anything like it.

When Java came out I was impressed with all the features it had built in: serialization, RMI, garbage collection, portability. It was in its time a game changing technology. Today I have the same impression of Scala: the level of static validation that it gives you, all this with minimal verbosity. If I could say just one thing to qualify it, I'd have to say: game changing.

So the answer is yes, Scala made Squeryl possible. I expect a lot of interesting Scala DSLs will get written in many domains in the coming years. I have a few other DSLs I'd like to write myself.


Any particular advice for someone beginning with Squeryl?

I would just copy an example from the Squeryl site, and modify it gradually so that it becomes your own schema. And most importantly, don't hesitate to ask questions in the discussion groups. I'm often impressed by the quality of the answers given by the community.


Thanks a lot for the great answers!

Thursday 6 May 2010

Using the Scala REPL to tell the difference between ЕКАТEРИНБУРГ and ЕКАТЕРИНБУРГ

Sometimes, one runs into UTF-8 strings with characters from different code blocks. This is problematic in cases where the fonts look the same, but the characters are different. The Scala REPL is handy for finding out what Unicode block each character in a string belongs to. Let's use "ЕКАТEРИНБУРГ" and "ЕКАТЕРИНБУРГ" as examples:

scala> "ЕКАТEРИНБУРГ" == "ЕКАТЕРИНБУРГ"
res0: Boolean = false

scala> import java.lang.Character.UnicodeBlock
import java.lang.Character.UnicodeBlock

scala> "ЕКАТEРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
E BASIC_LATIN
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala> "ЕКАТЕРИНБУРГ".foreach(c => println(c +"\t"+ UnicodeBlock.of(c)))
Е CYRILLIC
К CYRILLIC
А CYRILLIC
Т CYRILLIC
Е CYRILLIC
Р CYRILLIC
И CYRILLIC
Н CYRILLIC
Б CYRILLIC
У CYRILLIC
Р CYRILLIC
Г CYRILLIC

scala>
The REPL exposed one of the seemingly identical strings to be an unhealthy mix of Latin and Cyrillic characters. Thanks, REPL.

Sunday 11 April 2010

A tiny Scala case class to clean up user input

We needed some cleaning up of user input entered into a text field. We ended up with a Scala case class that cleans up its constructor string argument a bit, by removing multiple whitespace characters and trimming it. It behaves like this:

scala> Text("    a       a      ") == Text("a a")                                                                
res0: Boolean = true
scala> Text(" a a ").text == Text("a a").text
res1: Boolean = true
scala> Text(" a a ").text
res2: java.lang.String = a a

The code looks like this:
case class Text(private var _text: String) {
val text = _text.trim.replaceAll(" +", " ")
_text = text
}
Since the input string, var _text, is private, we can manipulate it a bit, without making it possible for others to tamper with. I'm not sure if this is the obvious way to do it, but it seems to work as intended.

We tried a similar version that did not work:
// Doesn't work
case class BrokenText(private var _text: String) {
_text = _text.trim.replaceAll(" +", " ")
val text = _text
}
This version does not work since Text.text will return the original string, not the cleaned up one:
scala> BrokenText("    a      b      ")
res0: BrokenText = BrokenText(a b)
scala> res0.text
res1: String = a b
scala>
Why the second version doesn't work? Beats me. (But I'm sure the answer will turn out to be obvious.)

Update: See the two anonymous comments below: one answering my question above, the other one suggesting a neater way of handling it. Thanks.

Thursday 4 February 2010

Scala: Getting into performance trouble, calling head and tail on an ArrayBuffer

Update: The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.

====================================

Recently, I wrote the following two different versions for doing the same thing (compute frequencies):

// Version 1  --- Don't do this, lousy performance
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq, Map())
}

// Version 2 --- 260 times faster than Version 1 on some input
def freq[T](seq: Seq[T]): Map[T, Int] = {
val freqs = collection.mutable.HashMap[T, Int]()
for(elem <- seq) {
val n = freqs.getOrElseUpdate(elem, 0)
freqs.update(elem, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}


When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:

val linesList = io.Source.fromPath("testfile.txt").getLines().toList
val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq

Version 1 called with linesSeq as input, performes horrlibly compared to when called with linesList. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of Seq is an ArrayBuffer, and 2) Calling head and tail on an ArrayBuffer is costly. The same operations are cheap on a List. That's why Version 1 above is a performance trap.

A possible way of getting better performance, is to change the inner, two argument, freq method to use List, instead of Seq:

// Version 1.b  --- Somewhat better
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq.toList, Map())
}


Better yet --- in Scala 2.8 --- is to scrap the entire method, and call groupBy(identity).mapValues(_.length) directly on the Seq...

Monday 1 February 2010

Counting Strings and Things in Scala (2.8)

I often need to count the frequencies of strings ("words", typically). Below are a few Scala snippets for counting strings and things. (Don't miss the last one.)

First try

Let's start with a method for counting string frequencies in a list:

// Scala 2.8
def freq(wds: List[String]): Map[String, Int] = {
import annotation._
@tailrec
def freq(wds: List[String], map: Map[String, Int]): Map[String, Int] = {
wds match {
case l if l.isEmpty => map
case l => {
val elem = l.head
val n = map.getOrElse(elem, 0) + 1
freq( l.tail, map + (elem -> n ) )
}
}
}
freq(wds, Map())
}

It takes a list of strings, and returns a map (hash table) with a frequency count for each unique string. The one argument freq method contains an embedded two argument freq method. The second method recursively consumes elements of the list, incrementing the frequency count of the second accumulator argument. The two argument method is initialised with an empty map (at the end of the one argument method, freq(wds, Map()).

In each recursion, a new, immutable word frequency map is produced, with the incremented frequency count. The
import annotation._
@tailrec
part tells the compiler to check whether it can optimize the tail recursive call or not. (The Scala compiler can optimize a special case of tail recursion.)

If the recursion makes you dizzy, you can use a mutable HashMap instead:
def freq(wds: List[String]): Map[String, Int] = {
val freqs = collection.mutable.HashMap[String, Int]()
for(w <- wds) {
val n = freqs.getOrElseUpdate(w, 0)
freqs.update(w, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}

Second try

You'll soon find out that the above code is limited, since it only accepts List input. There is a more general concept, Seq, that will make it possible to call freq with different kinds of sequences (lists, listbuffers, arrays):
// Scala 2.8
def freq(wds: Seq[String]): Map[String, Int] = {
import annotation._
@tailrec
def freq(wds: Seq[String], map: Map[String, Int]): Map[String, Int] = {
wds match {
case l if l.isEmpty => map
case l => {
val elem = l.head
val n = map.getOrElse(elem, 0) + 1
freq( l.tail, map + (elem -> n ) )
}
}
}
freq(wds, Map())
}

Third try

One day you find yourself relocated from the word counting department to the character counting department. A string is a sequence, but of Chars, not Strings. The code above will not help you count character frequencies. Here is an attempt at generalising the code further, to make it able to count the frequencies of any thing, T, not just String:
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq, Map())
}



Here's the more general non-tail-recursive version:
def freq[T](seq: Seq[T]): Map[T, Int] = {
val freqs = collection.mutable.HashMap[T, Int]()
for(elem <- seq) {
val n = freqs.getOrElseUpdate(elem, 0)
freqs.update(elem, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}

Hooray.

Last try (shamelessly lifted from someone at #scala)

But... you can still do better than this. A while ago, someone on the #scala irc channel (unfortunately, I don't remember this persons name) answered a question on how to associate each integer in a sequence with the number of times each integer occurred (or something like that). It turns out that, in Scala 2.8, it is possible to write a frequency counting thing even more compactly:
def freq[T](seq: Seq[T]) = seq.groupBy(x => x).mapValues(_.length)
It's so short, that it's almost not worth defining a method/function for it. You can simply call .groupBy(x => x).mapValues(_.length) directly on your Seq. (Or groupBy(identity).mapValues(_.length), which is the same thing.)

Double hooray.

Benchmarking is tricky, but a small test indicates that the last, most beautiful, version is also the quickest, and that the recursive ones using only immutable maps (in some situations) are quite slow.