Showing posts with label seq. Show all posts
Showing posts with label seq. Show all posts

Friday, 21 October 2011

Scala blunder: appending to a Seq that is a List

I recently made a mistake in a loop reading lines from a file, doing some string manipulation and adding the result to a collection. A seemingly trivial Scala script just refused to halt. My mistake is illustrated by the following two toy examples, adding integers to a Seq and a Vector, respectively:

var x1 = Vector[Int]()
for(i <- 0 to 100000) { x1 = x1 :+ i }

var x2 = Seq[Int]()
for(i <- 0 to 100000) { x2 = x2 :+ i }
One of the above for loops runs about 38,648 times slower than the other one (according to a single, somewhat sloppy benchmark using Scala 2.9.1). The explanation, I believe, is that the Seq turned out to be backed by a List. Lists hate being appended to (:+), and this hatred manifests itself in bad performance. Good to know if you want a program to be impressingly slow.

By the way, this made me think of another one:
var s1 = ""
for(i <- 0 to 100000) s1 = s1 + i

var s2 = ""
for(i <- 0 to 100000) s2 = s2.concat(i.toString)
I don't know why you'd want to create a string like the above, but the version using + is about four times slower than the one using concat (Scala 2.9.1).

Thursday, 4 February 2010

Scala: Getting into performance trouble, calling head and tail on an ArrayBuffer

Update: The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.

====================================

Recently, I wrote the following two different versions for doing the same thing (compute frequencies):

// Version 1  --- Don't do this, lousy performance
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq, Map())
}

// Version 2 --- 260 times faster than Version 1 on some input
def freq[T](seq: Seq[T]): Map[T, Int] = {
val freqs = collection.mutable.HashMap[T, Int]()
for(elem <- seq) {
val n = freqs.getOrElseUpdate(elem, 0)
freqs.update(elem, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}


When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:

val linesList = io.Source.fromPath("testfile.txt").getLines().toList
val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq

Version 1 called with linesSeq as input, performes horrlibly compared to when called with linesList. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of Seq is an ArrayBuffer, and 2) Calling head and tail on an ArrayBuffer is costly. The same operations are cheap on a List. That's why Version 1 above is a performance trap.

A possible way of getting better performance, is to change the inner, two argument, freq method to use List, instead of Seq:

// Version 1.b  --- Somewhat better
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq.toList, Map())
}


Better yet --- in Scala 2.8 --- is to scrap the entire method, and call groupBy(identity).mapValues(_.length) directly on the Seq...