Update: The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.
====================================
Recently, I wrote the following two different versions for doing the same thing (compute frequencies):
// Version 1 --- Don't do this, lousy performance
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq, Map())
}
// Version 2 --- 260 times faster than Version 1 on some input
def freq[T](seq: Seq[T]): Map[T, Int] = {
val freqs = collection.mutable.HashMap[T, Int]()
for(elem <- seq) {
val n = freqs.getOrElseUpdate(elem, 0)
freqs.update(elem, n + 1)
}
// Return immutable copy of freqs
Map() ++ freqs
}
When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:
val linesList = io.Source.fromPath("testfile.txt").getLines().toList
val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq
Version 1 called with
linesSeq
as input, performes horrlibly compared to when called with linesList
. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of Seq
is an ArrayBuffer
, and 2) Calling head and tail on an ArrayBuffer
is costly. The same operations are cheap on a List
. That's why Version 1 above is a performance trap.A possible way of getting better performance, is to change the inner, two argument,
freq
method to use List
, instead of Seq
:// Version 1.b --- Somewhat better
// Scala 2.8
def freq[T](seq: Seq[T]): Map[T, Int] = {
import annotation._
@tailrec
def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {
seq match {
case s if s.isEmpty => map
case s => {
val elem = s.head
val n = map.getOrElse(elem, 0) + 1
freq(s.tail, map + (elem -> n ))
}
}
}
freq(seq.toList, Map())
}
Better yet --- in Scala 2.8 --- is to scrap the entire method, and call
groupBy(identity).mapValues(_.length)
directly on the Seq
...
2 comments:
Scala 2.8 final will have more robust defaults. toSeq will return a Vector which will have reasonable complexity for both length/apply and head/tail. Vectors are essentially the same as Clojure's but with optimized appends.
Nikolaj, your 'groupBy' solution is fine. If speed is still an issue, do as indicated below. If that's not fast enough, use a mutable HashMap instead.
def countOccurences[K](s: Traversable[K]): Map[K, Int] = {
var m = Map[K, Int]()
for (elem <- s) {
m = m updated (elem, m.getOrElse(elem,0) + 1)
}
m
}
Post a Comment