Thursday, 4 February 2010

Scala: Getting into performance trouble, calling head and tail on an ArrayBuffer

Update: The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.

====================================

Recently, I wrote the following two different versions for doing the same thing (compute frequencies):

// Version 1  --- Don't do this, lousy performance
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq, Map())
}

// Version 2 --- 260 times faster than Version 1 on some input
def freq[T](seq: Seq[T]): Map[T, Int] = {
val freqs = collection.mutable.HashMap[T, Int]()
for(elem <- seq) {
val n = freqs.getOrElseUpdate(elem, 0)
freqs.update(elem, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}


When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:

val linesList = io.Source.fromPath("testfile.txt").getLines().toList
val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq

Version 1 called with linesSeq as input, performes horrlibly compared to when called with linesList. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of Seq is an ArrayBuffer, and 2) Calling head and tail on an ArrayBuffer is costly. The same operations are cheap on a List. That's why Version 1 above is a performance trap.

A possible way of getting better performance, is to change the inner, two argument, freq method to use List, instead of Seq:

// Version 1.b  --- Somewhat better
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq.toList, Map())
}


Better yet --- in Scala 2.8 --- is to scrap the entire method, and call groupBy(identity).mapValues(_.length) directly on the Seq...

2 comments:

martin said...

Scala 2.8 final will have more robust defaults. toSeq will return a Vector which will have reasonable complexity for both length/apply and head/tail. Vectors are essentially the same as Clojure's but with optimized appends.

Anonymous said...

Nikolaj, your 'groupBy' solution is fine. If speed is still an issue, do as indicated below. If that's not fast enough, use a mutable HashMap instead.

def countOccurences[K](s: Traversable[K]): Map[K, Int] = {
var m = Map[K, Int]()
for (elem <- s) {
m = m updated (elem, m.getOrElse(elem,0) + 1)
}
m
}