Friday 29 April 2011

Testing Scala 2.9.0 (RC2) parallel collections: four extra key strokes, double speed

We have just tried the new parallel collections that you can find in Scala 2.9.0.RC2.

By adding .par at a few places, the software we tested ran almost twice (1.9 x) as fast on a two core processor. Running the same code on a four core processor was, as expected, quicker (2.7 x), but not four times as fast. That's quite a performance boost, with close to zero programming effort.

The software we've tested validates (electronic) pronunciation dictionaries, where each entry has an orthography, a phonetic transcription and some other stuff. The program runs a large number of quality checks to find problems (faulty transcriptions, inconsistencies, etc) that are hard or impossible for a human lexicographer to find. It runs hundreds or even thousands of validation rules, using regular expressions and other string processing, on a hundred thousand or more dictionary entries.

The software runs a sequence of validation rules on each input entry. The validation rules are independent of each other, suitable for running in parallel. The rules, living in a Seq, are applied in sequence in a call to map(...). By calling .par.map(...) on the Seqs holding the validation rules, a multi-core processor is now able to perform the validation in parallel (par returns a parallel version of a collection).

Apart from using parallel collections at the point where the validation rules are run, we also run the main loop, reading the input lexicon data, using a parallel collection. Adding parallel collections at different places (the outermost loop and inside the validation) seems to add to the performance gain.

An initial problem that we had, was that the Scala 2.9.0.RC2 API documentation fooled us to believe that foldLeft would, just like map, run in parallel. That appears to be incorrect. We had to change calls to foldLeft into calls to map (followed by an additional foldLeft to aggregate the result). I don't know if I've misunderstood the documentation, or if parallel foldLeft is pending.

Anyway, double speed, or more, with zero effort. It sounds too good to be true, but this quick test suggests that it works like a charm.

And now I want more cores.

5 comments:

J. Suereth said...

foldLeft is an operation that can't really be parallelized. What you want is a reduce, like in map-reduce, where you can combine two elements and return a third and *ordering does not matter*. foldLeft implies a left-associative ordering.

Nikolaj Lindberg said...

Ah, right. I guess I should have tried to find out what operations you can parallelize.

Thanks a lot for your explantion.

(But I still think the API documentation of foldLeft could be interpreted as if it maybe should work. An example from the nightly docs :

"Applies a binary operator to a start value and all elements of this mutable parallel sequence, going left to right.

Note: will not terminate for infinite-sized collections.

Note: might return different results for different runs, unless the underlying collection type is ordered. or the operator is associative and commutative.")

Nikolaj Lindberg said...

Gah. Above, it should read "Thanks a lot for your explanation". (Blogger will not let me edit my own comments.)

Mushtaq said...

Did you manually set the degree parallelism using the following?

collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(parlevel: Int)

Using appropriate level I saw more 5 times performance boost for non-cpu-bound operations on a dual core.

I first saw that in Alex's answer here:

http://stackoverflow.com/questions/5424496/scala-parallel-collections-degree-of-parallelism

Nikolaj Lindberg said...

Mushtaq,

no, I didn't know about the stuff in your tip. Very interesting. I will give it a try.

Thanks.