tag:blogger.com,1999:blog-38406875156156867382023-11-15T18:45:52.328+01:00Nikoloogle LindbloogleNikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.comBlogger61125tag:blogger.com,1999:blog-3840687515615686738.post-12808514013358604422017-05-30T20:30:00.000+02:002018-01-25T17:32:54.351+01:00Go: Checking what Unicode range a character belongs to<b>Update:</b> There is another, better way to get the name of the Unicode range a rune belongs to than described below:<br />
<br />
<pre><code>
import ("golang.org/x/text/unicode/runenames")
...
name := runenames.Name('م') //ARABIC LETTER MEEM
...
</code></pre>
<br />
<a href="https://play.golang.org/p/GQEeMyIACj_Y">https://play.golang.org</a><br />
<br />
<br />
<br />
Below is an alternative way:<br />
<br />
If you want to know what part of the Unicode table a character (<i>rune</i>) belongs to in Go, you can use the <tt>Scripts</tt> map found in the <tt>unicode</tt> package:<br />
<br />
<pre><code>
r := 'ن' // The isolated form of Arabic 'n'</code></pre>
<pre><code>
for s, t := range unicode.Scripts {
if unicode.In(r, t) {
fmt.Println(s) // Arabic
}
}
</code></pre>
<br />
<a href="https://play.golang.org/p/JydOLjDW_D">https://play.golang.org/</a><br />
<br />
<br />
The map <tt>unicode.Scripts</tt> contains the names of the different parts of the Unicode table, such as Latin, Greek, Arabic, Cyrillic, etc. Each such name is associated with a <tt>RangeTable</tt>, representing a subset of the Unicode character set. The <tt>unicode.In</tt> function in the snippet above checks whether a rune <tt>r</tt> is found in the RangeTable <tt>t</tt>.<br />
<br />
Checking what part of the Unicode table a character belongs to, can be useful for validating that all characters of a string belong to the same script. For example, the Latin and Cyrillic scripts have characters that look identical, but are different characters. Examples are c-с, p-р and a-а. They may look identical, but are represented by different Unicode code points. If you mix Latin and Cyrillic characters in a string, you might for instance not find an expected match in a database search.
<br />
<br />
<pre><code>
c1 := 'c' // Latin
c2 := 'с' // Cyrillic
fmt.Println(c1 == c2) // false
fmt.Printf("%U\n", c1) // U+0063
fmt.Printf("%U\n", c2) // U+0441
</code></pre>
<br />
<a href="https://play.golang.org/p/PAyXzwFRP7">https://play.golang.org/</a>Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-18486483705985542982017-05-19T10:32:00.000+02:002017-06-20T20:16:06.174+02:00ᚠᚢᚦᚬᚱᚴ Go strings and runes: Watch out for len(str)!In the Go programming language, a string is made up of bytes, not characters. Sort of.<br />
<br />
<h3>
Beware of calling len() on a string</h3>
Consider the string <span style="font-family: inherit;">"kääntäjä"</span>. It has eight characters --- and it means 'translator' in Finnish --- but when I put this string into a Go program, and check its length using the built-in len() function, I get 12, <i>not</i> 8:<br />
<br />
<pre><code>
s := "kääntäjä"
l := len(s) // 12
</code></pre>
<br />
<a href="https://play.golang.org/p/an1WrVGNA5">https://play.golang.org/</a><br />
<br />
The length of "jaa", using len(), yields the expected 3. But len("jää") returns 5!<br />
(I'm told that "jää" means 'ice' in Finnish.)<br />
<br />
Indexing into a string is a similarly unrewarding exercise:<br />
<br />
<pre><code>
s := "ä"
l := len(s)
fmt.Println(l) // 2
fmt.Println(s[0]) // 195
fmt.Println(s[1]) // 164
</code></pre>
<br />
<br />
<a href="https://play.golang.org/p/hvmHUJLNXA">https://play.golang.org/</a><br />
<br />
The "ä" single character string, seems to be made up of two different integers...?!<br />
<br />
If you are mostly interested in strings as a representation of text --- as a sequence of (alphabetic) characters --- you should not use len() this way, or index into a string as above. The reason is that what may look like a string of characters is an array of bytes, in which each byte may or may not correspond to an actual character in your string.<br />
<br />
UTF-8 uses a scheme of variable numbers of bytes to represent different parts of the Unicode character tables. The ASCII characters, a-z, 0-9 and a few other, only take one byte to encode, but other characters may take more than one.<br />
<br />
(UTF-8 handles this in some clever way, so that only the first few bits have to be inspected, to figure out how many bytes a character is made up of. I think.)<br />
<br />
<h3>
Strings as runes (no, not the old Norse kind)</h3>
However, if you loop over a string using Go's built in range function, you will get the characters of the string, one by one. Or rather, the unique Unicode code point for each character. The snippet below loops over a string, and prints the indices and characters one by one. You can use the %c Printf formatting to turn a Unicode code point into an actual character:<br />
<pre><code>
s := "jää"
for i, r := range s {
fmt.Printf("%d %c\n", i, r)
}
// Prints:
// 0 j
// 1 ä
// 3 ä
</code></pre>
<br />
<a href="https://play.golang.org/p/_PfCSAdhxu">https://play.golang.org/</a><br />
<br />
Notice how the indices of the range loop above skips a number (from 1 to 3), since the "ä" character (rune) is made up of two bytes.<br />
<br />
The range loop turns the string into a sequence of runes. A Go "rune" should not be confused with old Scandinavian runes (ᚠᚢᚦᚬᚱᚴ, ...), but that could have been fun. A rune in Go is merely a data type that holds an integer. This integer represents a character, a Unicode code point.<br />
<br />
<pre><code> var r rune
r = 78
fmt.Printf("%c\n", r) // Prints N
</code></pre>
<br />
<a href="https://play.golang.org/p/muI04wg3Ga">https://play.golang.org/</a><br />
<br />
Notice that since a rune is just an integer, you can assign an illegal value, not representing an actual Unicode character, to it: for example r = -765.<br />
Once in a string, an invalid code point will somehow turn into the <span style="font-family: "menlo" , monospace; font-size: 14.6667px;">�</span> character ('\ufffd').<br />
<h3>
</h3>
<h3>
</h3>
<h3>
</h3>
<h3>
Counting characters in strings</h3>
There are different ways to count the characters (runes) of strings. One way is to convert a string into a sequence of runes:<br />
<br />
<pre><code>
s := "Motörhead play Björk"
r := []rune(s)
fmt.Println(len(s)) // 22 (Bleh!)
fmt.Println(len(r)) // 20 (Yay!)
</code></pre>
<br />
<a href="https://play.golang.org/p/_pZ340hhtc">https://play.golang.org/</a><br />
<br />
<br />
Another way to count characters is to import "unicode/utf8" and call utf8.RuneCountInString:<br />
<br />
<pre><code> utf8.RuneCountInString("Motörhead play Björk") // 20
</code></pre>
<br />
<br />
(You can also loop over a string using "range", as above, and count the characters one by one.)<br />
<h3>
<br />Runes to string</h3>
You can convert a sequence of runes back into a string using string(runes):<br />
<br />
<pre><code> string([]rune{66, 106, 246, 114, 107}) // "Björk"
</code></pre>
<br />
<div>
<br /></div>
<br />Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-76632748576181999772017-02-17T09:57:00.000+01:002017-06-20T20:16:06.178+02:00Go 1.8 sort.SliceGo version 1.8 was published February 16 2017.<br />
<br />
For me, the most noteworthy update was that a <a href="https://golang.org/pkg/sort/#Slice">sort.Slice</a> function has been added to the standard library.<br />
<br />
Now you can sort a Go slice (list), without losing the will to live.<br />
Go 1.8 will save lives.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-11619133279360974322011-10-21T15:50:00.000+02:002011-10-21T15:50:32.419+02:00Scala blunder: appending to a Seq that is a ListI recently made a mistake in a loop reading lines from a file, doing some string manipulation and adding the result to a collection. A seemingly trivial Scala script just refused to halt.
My mistake is illustrated by the following two toy examples, adding integers to a <code>Seq</code> and a <code>Vector</code>, respectively:
<pre>var x1 = Vector[Int]()
for(i <- 0 to 100000) { x1 = x1 :+ i }
var x2 = Seq[Int]()
for(i <- 0 to 100000) { x2 = x2 :+ i }</pre>
One of the above for loops runs about 38,648 times slower than the other one (according to a single, somewhat sloppy benchmark using Scala 2.9.1).
The explanation, I believe, is that the <code>Seq</code> turned out to be backed by a <code>List</code>. Lists hate being appended to (<code>:+</code>), and this hatred manifests itself in bad performance. Good to know if you want a program to be <em>impressingly</em> slow.
<p></p>
By the way, this made me think of another one:
<pre>var s1 = ""
for(i <- 0 to 100000) s1 = s1 + i
var s2 = ""
for(i <- 0 to 100000) s2 = s2.concat(i.toString)
</pre>
I don't know why you'd want to create a string like the above, but the version using <code>+</code> is about four times slower than the one using <code>concat</code> (Scala 2.9.1).Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-83560762460663309342011-09-28T13:51:00.000+02:002011-09-28T13:57:50.581+02:00Programming Scala without... anythingYou don't need all that fancy, modern stuff. A keyboard, a terminal window and the scala command are all you need:
<br />
<pre>$ scala -e 'println(io.Source.fromFile("freq_list.txt").getLines().map(_.split("\t")(0).toInt).sum)'
71213401</pre>
(Prints the result of summing the frequency numbers found in the first tab separated field of file <code>freq_list.txt</code>. The result turned out to be 71213401.)<br />
<br />
When the programs get longer, you better stay focused.
Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-48566600433131515472011-04-29T19:20:00.003+02:002011-04-29T19:30:31.244+02:00Testing Scala 2.9.0 (RC2) parallel collections: four extra key strokes, double speedWe have just tried the new parallel collections that you can find in Scala <a href="http://www.scala-lang.org/node/9314">2.9.0.RC2</a>. <br />
<br />
By adding <code>.par</code> at a few places, the software we tested ran almost twice (1.9 x) as fast on a two core processor. Running the same code on a four core processor was, as expected, quicker (2.7 x), but not four times as fast. That's quite a performance boost, with close to zero programming effort.<br />
<br />
The software we've tested validates (electronic) pronunciation dictionaries, where each entry has an orthography, a phonetic transcription and some other stuff. The program runs a large number of quality checks to find problems (faulty transcriptions, inconsistencies, etc) that are hard or impossible for a human lexicographer to find. It runs hundreds or even thousands of validation rules, using regular expressions and other string processing, on a hundred thousand or more dictionary entries.<br />
<br />
The software runs a sequence of validation rules on each input entry. The validation rules are independent of each other, suitable for running in parallel. The rules, living in a <code>Seq</code>, are applied in sequence in a call to <code>map(...)</code>. By calling <code>.par.map(...)</code> on the <code>Seqs</code> holding the validation rules, a multi-core processor is now able to perform the validation in parallel (<code>par</code> returns a parallel version of a collection).<br />
<br />
Apart from using parallel collections at the point where the validation rules are run, we also run the main loop, reading the input lexicon data, using a parallel collection. Adding parallel collections at different places (the outermost loop and inside the validation) seems to add to the performance gain.<br />
<br />
An initial problem that we had, was that the Scala 2.9.0.RC2 API documentation fooled us to believe that <code>foldLeft</code> would, just like <code>map</code>, run in parallel. That appears to be incorrect. We had to change calls to <code>foldLeft</code> into calls to <code>map</code> (followed by an additional <code>foldLeft</code> to aggregate the result). I don't know if I've misunderstood the documentation, or if parallel <code>foldLeft</code> is pending.<br />
<br />
Anyway, double speed, or more, with zero effort. It sounds too good to be true, but this quick test suggests that it works like a charm.<br />
<br />
And now I want more cores.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com5tag:blogger.com,1999:blog-3840687515615686738.post-9658727492555226212010-09-15T10:20:00.000+02:002010-09-15T10:20:13.549+02:00Interview with Maxime Lévesque, author of Squeryl<div><a href="http://squeryl.org/">Squeryl</a> is a great Scala database API. On its website, it is describe like this: "A Scala ORM and DSL for talking with Databases with minimum verbosity and maximum type safety".</div><div><br />
</div><div>Preparing an introduction to Squeryl for a Swedish computer magazine, I sent a number of questions to Maxime Lévesque, the man behind Squeryl. The answers were so interesting, that I asked his permission to post them here:<br />
<b><span class="Apple-style-span" style="font-weight: normal;"><br />
</span></b><br />
<b><span class="Apple-style-span" style="font-weight: normal;"><br />
</span></b></div><div><div><b>Could you describe yourself in a few words?</b></div><div><br />
</div><div>I'm a dad, a programmer, a hobbyist bass player and percussionist.</div><div><br />
</div><div>I'm the kind of programmer who prefers to write libraries and frameworks to writing applications. If I was in the construction industry I'd probably be making bricks, mortar and nails rather than houses.</div><div><br />
</div><div><br />
</div><div><b>Do you develop Squeryl as part of your work, or is it a hobby?</b></div><div><br />
</div><div>Squeryl started as a hobby, only later did I start using it in a commercial project.</div><div></div><div><br />
<br />
</div><div><b>What are the most important features of Squeryl? Why should you use it?</b></div><div><br />
</div><div>The main reason to use Squeryl in an application, in my opinion, is to have the data access code validated by the compiler. I've seen many projects where the database schema stops evolving after a lot of code has been written against it. Ugly workarounds are sometimes chosen because there isn't enough time to investigate the repercussions of a schema change or conduct all the testing required.</div><div><br />
</div><div>Strongly typed languages are good for "deterministic refactoring". A data access layer needs to be refactorable, as any part of a system does. Perhaps to an even greater extent, because in a sense, bad design decisions get persisted with the data.</div><div><br />
</div><div>A developer needs all the help he can get from tools such as compilers and IDEs. Hard work and discipline don't scale. Why rely on it when you can have automated validation?</div><div><br />
</div><div>Reusability is another big one. Squeryl queries are composable, reusable pieces of code. A query that encodes a particular piece of application logic needs only be written once, and reused anywhere it is needed. I'm a big believer in the DRY principle (Don't Repeat Yourself).</div><div><br />
</div><div>Low verbosity would be another strength. I dislike APIs or frameworks that require you to write more than you should.</div><div></div><div><br />
<br />
</div><div><b>What's the story behind Squeryl?</b></div><div><br />
</div><div>In 2005 I wrote an ORM for dotNet. I was in need of one at the time and I couldn't find a decent one that exploited generics and annotations, so I wrote my own. By the time I considered publishing it, LINQ came out, and instantly obsoleted my ORM (and all other ORMs except HaskelDB in my opinion).</div><div><br />
</div><div>A few years later I started to write a query DSL in Java, and at every step, I got bitten by language limitations. Every time I worked around them, the solution became a bit more ugly and verbose. I then discovered Scala, and started experimenting with writing a statically typed query DSL. I was amazed by the expressivity of the language.</div><div><br />
</div><div>The fact that it was possible to write Squeryl as a library (i.e., without a compiler plug-in) speaks a lot about the potency of the language. The first two attempts were abandoned when they reached a critical level of inelegance. They were Squeryl's pre-history.</div><div><br />
</div><div>Squeryl is in fact my third attempt at a Scala ORM. When I became confident that a fourth rewrite wouldn't be necessary, I published it on GitHub.</div><div><br />
</div><div><br />
</div><div><b>If Squeryl didn't exist, what would you use?</b></div><div><br />
</div><div>If Squeryl didn't exist, I'd have a look at ScalaQuery or Circumflex. I only have a superficial knowledge of them, but I would surely try them out before going to any of the Java based ORMs.</div><div><br />
</div><div><br />
</div><div><b>If you are to demo Squeryl (e.g., to a Java programmer), do you have a favourite example?</b></div><div><br />
</div><div>Here's a one liner that says a lot :</div><div><br />
</div><div><pre><span class="Apple-style-span" style="color: #741b47;">val</span> <span class="Apple-style-span" style="color: #bf9000;">avgHeight</span>: <span class="Apple-style-span" style="color: #38761d;">Option[Float]</span> =
from(people)(p <span class="Apple-style-span" style="color: #990000;">=> </span>compute(avg(p.heightInCentimeters)))</pre></div><div><br />
</div><div>Apart from the shortness of the code, we can see a few implicit conversions at work. The compiler "knows" that the sum query can translate into a 32 bit floating point value, but it also "knows" that it is an Option[], because the avg aggregate function is not guaranteed to return something (the table can be empty). In fact it won't compile if you try to refer to it as a (non Option[]) Float.</div><div></div><div><br />
<br />
</div><div><b>Where has Squeryl turned up? Who uses it?</b></div><div><br />
</div><div>I haven't made any survey, it's on my todo list, but I've exchanged emails with developers that are building systems with Squeryl in fields ranging from finance to bioinformatics.</div><div><br />
</div><div><br />
</div><div><b>I read something about Lift...?</b> </div><div><br />
</div><div>Ross Mellgren from the Lift team has written an integration module that is part of Lift 2.1 (release candidate).</div><div><br />
</div><div><br />
</div><div><b>What's on the roadmap?</b></div><div><br />
</div><div>High on my priority list is free text search (backed by Lucene). Longer term I'd like to add things like support for sharding and extending the DSL to exploit the geospatial capabilities of databases like Postgres, Oracle and H2.</div><div><br />
</div><div><br />
</div><div><b>Is it of any importance that Squeryl was written i Scala? Or was this merely a coincidence?</b></div><div><br />
</div><div>Without Scala there wouldn't be a way to have strongly typed queries on the JVM without having verbosity that reaches a caricatural level. Not only wouldn't there be Squeryl, but there wouldn't be anything like it.</div><div><br />
</div><div>When Java came out I was impressed with all the features it had built in: serialization, RMI, garbage collection, portability. It was in its time a game changing technology. Today I have the same impression of Scala: the level of static validation that it gives you, all this with minimal verbosity. If I could say just one thing to qualify it, I'd have to say: game changing.</div><div><br />
</div><div>So the answer is yes, Scala made Squeryl possible. I expect a lot of interesting Scala DSLs will get written in many domains in the coming years. I have a few other DSLs I'd like to write myself.</div><div></div><div><br />
<br />
</div><div><b>Any particular advice for someone beginning with Squeryl?</b></div><div><br />
</div><div>I would just copy an example from the Squeryl site, and modify it gradually so that it becomes your own schema. And most importantly, don't hesitate to ask questions in the discussion groups. I'm often impressed by the quality of the answers given by the community.<br />
<br />
<br />
<b>Thanks a lot for the great answers!</b></div></div>Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-6780095147125302202010-05-06T15:46:00.013+02:002010-05-06T16:38:35.852+02:00Using the Scala REPL to tell the difference between ЕКАТEРИНБУРГ and ЕКАТЕРИНБУРГSometimes, one runs into UTF-8 strings with characters from different code blocks. This is problematic in cases where the fonts look the same, but the characters are different. The Scala REPL is handy for finding out what Unicode block each character in a string belongs to. Let's use "ЕКАТEРИНБУРГ" and "ЕКАТЕРИНБУРГ" as examples:<br /><br /><pre>scala> <span style="color: rgb(0, 102, 0);">"ЕКАТEРИНБУРГ"</span> == <span style="color: rgb(0, 102, 0);">"ЕКАТЕРИНБУРГ"</span><br />res0: Boolean = false<br /><br />scala> import java.lang.Character.UnicodeBlock<br />import java.lang.Character.UnicodeBlock<br /><br />scala> <span style="color: rgb(0, 102, 0);">"ЕКАТEРИНБУРГ"</span>.foreach(c => println(c +<span style="color: rgb(0, 102, 0);">"\t"</span>+ UnicodeBlock.of(c)))<br />Е CYRILLIC<br />К CYRILLIC<br />А CYRILLIC<br />Т CYRILLIC<br />E BASIC_LATIN<br />Р CYRILLIC<br />И CYRILLIC<br />Н CYRILLIC<br />Б CYRILLIC<br />У CYRILLIC<br />Р CYRILLIC<br />Г CYRILLIC<br /><br />scala> <span style="color: rgb(0, 102, 0);">"ЕКАТЕРИНБУРГ"</span>.foreach(c => println(c +<span style="color: rgb(0, 102, 0);">"\t"</span>+ UnicodeBlock.of(c)))<br />Е CYRILLIC<br />К CYRILLIC<br />А CYRILLIC<br />Т CYRILLIC<br />Е CYRILLIC<br />Р CYRILLIC<br />И CYRILLIC<br />Н CYRILLIC<br />Б CYRILLIC<br />У CYRILLIC<br />Р CYRILLIC<br />Г CYRILLIC<br /><br />scala><br /></pre>The REPL exposed one of the seemingly identical strings to be an unhealthy mix of Latin and Cyrillic characters. Thanks, REPL.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-45242286232763295542010-04-11T00:31:00.053+02:002010-04-12T10:58:14.422+02:00A tiny Scala case class to clean up user inputWe needed some cleaning up of user input entered into a text field. We ended up with a Scala case class that cleans up its constructor string argument a bit, by removing multiple whitespace characters and trimming it. It behaves like this:<br /><pre>scala> Text(" a a ") == Text("a a") <br />res0: Boolean = true<br />scala> Text(" a a ").text == Text("a a").text <br />res1: Boolean = true<br />scala> Text(" a a ").text <br />res2: java.lang.String = a a<br /></pre><br />The code looks like this:<br /><pre>case class Text(private var _text: String) {<br /> val text = _text.trim.replaceAll(" +", " ")<br /> _text = text<br />}</pre>Since the input string, <code>var _text</code>, is private, we can manipulate it a bit, without making it possible for others to tamper with. I'm not sure if this is the obvious way to do it, but it seems to work as intended.<br /><br />We tried a similar version that did not work:<pre>// Doesn't work<br />case class BrokenText(private var _text: String) {<br /> _text = _text.trim.replaceAll(" +", " ")<br /> val text = _text<br />}</pre>This version does not work since <code>Text.text</code> will return the original string, not the cleaned up one:<pre>scala> BrokenText(" a b ")<br />res0: BrokenText = BrokenText(a b)<br />scala> res0.text<br />res1: String = a b <br />scala></pre>Why the second version doesn't work? Beats me. (But I'm sure the answer will turn out to be obvious.)<br /><br /><span style="font-weight: bold;">Update:</span> See the two anonymous comments below: one answering my question above, the other one suggesting a neater way of handling it. Thanks.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-87059298120829203692010-02-04T09:50:00.003+01:002010-02-04T11:17:16.369+01:00Scala: Getting into performance trouble, calling head and tail on an ArrayBuffer<span style="font-weight: bold;">Update:</span> The performance problem described below will be remedied in the final release of Scala 2.8. See martin's comment.<br /><br />====================================<br /><br />Recently, I wrote the following two different versions for doing the same thing (compute frequencies):<br /><pre>// Version 1 --- Don't do this, lousy performance<br />// Scala 2.8<br />def freq[T](seq: Seq[T]): Map[T, Int] = { <br /> import annotation._<br /> @tailrec<br /> def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {<br /> seq match {<br /> case s if s.isEmpty => map<br /> case s => {<br /> val elem = s.head<br /> val n = map.getOrElse(elem, 0) + 1<br /> freq(s.tail, map + (elem -> n ))<br /> }<br /> }<br /> }<br /> freq(seq, Map())<br />}<br /><br />// Version 2 --- 260 times faster than Version 1 on some input<br />def freq[T](seq: Seq[T]): Map[T, Int] = {<br /> val freqs = collection.mutable.HashMap[T, Int]()<br /> for(elem <- seq) { <br /> val n = freqs.getOrElseUpdate(elem, 0) <br /> freqs.update(elem, n + 1) <br /> } <br /> // Return immutable copy of freqs <br /> Map() ++ freqs<br />}</pre><br /><br />When comparing the two versions, it turned out that for some input, Version 1 was about 260 times slower (after JVM warm-up). The performance difference surfaced when both versions were called with the following different inputs:<br /><br /><pre>val linesList = io.Source.fromPath("testfile.txt").getLines().toList<br />val linesSeq = io.Source.fromPath("testfile.txt").getLines().toSeq</pre><br />Version 1 called with <code>linesSeq</code> as input, performes horrlibly compared to when called with <code>linesList</code>. On my own, I couldn't figure out why, but helpful and knowledgeable people at #scala solved my problem in a few seconds. The explanation appears to be that 1) The default implementation of <code>Seq</code> is an <code>ArrayBuffer</code>, and 2) Calling head and tail on an <code>ArrayBuffer</code> is costly. The same operations are cheap on a <code>List</code>. That's why Version 1 above is a performance trap.<br /><br />A possible way of getting better performance, is to change the inner, two argument, <code>freq</code> method to use <code>List</code>, instead of <code>Seq</code>:<br /><br /><pre>// Version 1.b --- Somewhat better<br />// Scala 2.8<br />def freq[T](seq: Seq[T]): Map[T, Int] = { <br /> import annotation._<br /> @tailrec<br /> def freq(seq: List[T], map: Map[T, Int]): Map[T, Int] = {<br /> seq match {<br /> case s if s.isEmpty => map<br /> case s => {<br /> val elem = s.head<br /> val n = map.getOrElse(elem, 0) + 1<br /> freq(s.tail, map + (elem -> n ))<br /> }<br /> }<br /> }<br /> freq(seq.toList, Map())<br />}</pre><br /><br />Better yet --- in Scala 2.8 --- is to scrap the entire method, and call <code>groupBy(identity).mapValues(_.length)</code> directly on the <code>Seq</code>...Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-48782266497788089712010-02-01T22:26:00.041+01:002010-02-03T15:15:58.418+01:00Counting Strings and Things in Scala (2.8)I often need to count the frequencies of strings ("words", typically). Below are a few Scala snippets for counting strings and things. (Don't miss the last one.)<br /><br /><span style="font-weight: bold;">First try</span><br /><br />Let's start with a method for counting string frequencies in a list:<br /><pre>// Scala 2.8<br />def freq(wds: List[String]): Map[String, Int] = { <br /> import annotation._<br /> @tailrec<br /> def freq(wds: List[String], map: Map[String, Int]): Map[String, Int] = {<br /> wds match {<br /> case l if l.isEmpty => map<br /> case l => {<br /> val elem = l.head<br /> val n = map.getOrElse(elem, 0) + 1<br /> freq( l.tail, map + (elem -> n ) )<br /> }<br /> }<br /> }<br /> freq(wds, Map())<br />}</pre><br />It takes a list of strings, and returns a map (hash table) with a frequency count for each unique string. The one argument freq method contains an embedded two argument freq method. The second method recursively consumes elements of the list, incrementing the frequency count of the second accumulator argument. The two argument method is initialised with an empty map (at the end of the one argument method, <code>freq(wds, Map()</code>).<br /><br />In each recursion, a new, immutable word frequency map is produced, with the incremented frequency count. The <pre>import annotation._<br />@tailrec</pre> part tells the compiler to check whether it can optimize the tail recursive call or not. (The Scala compiler can optimize a special case of tail recursion.)<br /><br />If the recursion makes you dizzy, you can use a mutable <code>HashMap</code> instead:<br /><pre>def freq(wds: List[String]): Map[String, Int] = {<br /> val freqs = collection.mutable.HashMap[String, Int]()<br /> for(w <- wds) { <br /> val n = freqs.getOrElseUpdate(w, 0) <br /> freqs.update(w, n + 1) <br /> } <br /> // Return immutable copy of freqs <br /> Map() ++ freqs <br />}</pre><br /><span style="font-weight: bold;">Second try</span><br /><br />You'll soon find out that the above code is limited, since it only accepts <code>List</code> input. There is a more general concept, <code>Seq</code>, that will make it possible to call <code>freq</code> with different kinds of sequences (lists, listbuffers, arrays):<br /><pre>// Scala 2.8<br />def freq(wds: Seq[String]): Map[String, Int] = { <br /> import annotation._<br /> @tailrec<br /> def freq(wds: Seq[String], map: Map[String, Int]): Map[String, Int] = {<br /> wds match {<br /> case l if l.isEmpty => map<br /> case l => {<br /> val elem = l.head<br /> val n = map.getOrElse(elem, 0) + 1<br /> freq( l.tail, map + (elem -> n ) )<br /> }<br /> }<br /> }<br /> freq(wds, Map())<br />}</pre><br /><span style="font-weight: bold;">Third try</span><br /><br />One day you find yourself relocated from the word counting department to the character counting department. A string is a sequence, but of <code>Chars</code>, not <code>Strings</code>. The code above will not help you count character frequencies. Here is an attempt at generalising the code further, to make it able to count the frequencies of any thing, <code>T</code>, not just <code>String</code>:<br /><pre>// Scala 2.8<br />def freq[T](seq: Seq[T]): Map[T, Int] = { <br /> import annotation._<br /> @tailrec<br /> def freq(seq: Seq[T], map: Map[T, Int]): Map[T, Int] = {<br /> seq match {<br /> case s if s.isEmpty => map<br /> case s => {<br /> val elem = s.head<br /> val n = map.getOrElse(elem, 0) + 1<br /> freq(s.tail, map + (elem -> n ))<br /> }<br /> }<br /> }<br /> freq(seq, Map())<br />}</pre><br /><br /><br />Here's the more general non-tail-recursive version:<br /><pre>def freq[T](seq: Seq[T]): Map[T, Int] = {<br /> val freqs = collection.mutable.HashMap[T, Int]()<br /> for(elem <- seq) { <br /> val n = freqs.getOrElseUpdate(elem, 0) <br /> freqs.update(elem, n + 1) <br /> } <br /> // Return immutable copy of freqs <br /> Map() ++ freqs <br />}</pre><br />Hooray.<br /><br /><span style="font-weight: bold;">Last try</span> (shamelessly lifted from someone at #scala)<br /><br />But... you can still do better than this. A while ago, someone on the #scala irc channel (unfortunately, I don't remember this persons name) answered a question on how to associate each integer in a sequence with the number of times each integer occurred (or something like that). It turns out that, in Scala 2.8, it is possible to write a frequency counting thing even more compactly:<br /><pre>def freq[T](seq: Seq[T]) = seq.groupBy(x => x).mapValues(_.length)</pre>It's so short, that it's almost not worth defining a method/function for it. You can simply call <code>.groupBy(x => x).mapValues(_.length)</code> directly on your <code>Seq</code>. (Or <code>groupBy(identity).mapValues(_.length)</code>, which is the same thing.)<br /><br />Double hooray. <br /><br />Benchmarking is tricky, but a small test indicates that the last, most beautiful, version is also the quickest, and that the recursive ones using only immutable maps (in some situations) are quite slow.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com4tag:blogger.com,1999:blog-3840687515615686738.post-78757624719107000842009-11-22T16:33:00.023+01:002009-11-24T12:25:09.343+01:00Beware! scala.swing.TextField proclaims EditDone when it isn't<span style="font-weight: bold;">Update:</span> Forget about <code>EditDone</code>. See <span style="font-weight: bold;">Update</span> below!<br /><br /><code>scala.swing.TextField</code> is a basic GUI component that can be used for<br />letting the user input a line of text. When listening to this component, one can react to an <code>EditDone</code> event:<br /><pre>// Inside some GUI component ...<br />val textField = new TextField(20)<br />contents += textField<br />listenTo(textField)<br /><br />reactions += {case EditDone(`textField`) =><br /> println("Ok, searching DB for input "+ textField.text)<br />}<br />//...</pre><br /><br />Fine. Whenever the user (me) hits the Enter key, the message, "Ok,<br />searching for DB input ...", simulating a database search, is printed.<br /><br />However, what happens when some unrelated software product suddenly<br />pops up a window while the user (me) is still inputting text into the<br />TextField? I tell you what: The evil, non-sentient contraption prints<br />the simulated search message --- just as if I had hit Enter.<br /><br />When the TextField loses focus, it emits an EditDone event. But I'm<br />not done editing. I've only typed "a". I was about to type<br />"abecedarian". Now the silly thing will search the database for all<br />words containing the letter "a". I never told it to do that. This<br />happened just because some other, unrelated, ill-behaving program<br />grabbed the focus.<br /><br />Of course, the focus may also be lost because the user voluntarily<br />changes windows (for instance, in order to Google for "abecedarian").<br /><br />As far as I can tell, there is no sane way to tell an EditDone event<br />produced by the user (me) hitting Enter from an EditDone event<br />produced because the TextField component lost focus. This cannot be<br />right.<br /><br />(A while ago, I asked about this on the Scala-user list. Not one single<br />answer from one single soul in the entire Universe. It feels lonely.)<br /><br />(I'm using Scala 2.8.)<br /><br /><span style="font-weight: bold;">Update:</span> Forget about <code>EditDone</code>.<br /><br />What you should do, is not to listen to the <code>TextField</code>, but to <code>TextField.keys</code>. This way, you'll be able to catch a <code>KeyPressed</code> event, and check if the key pressed was <code>Enter</code>. Simple.<br /><br />It's a bit tricky to figure out, however, since it's not in the <code>TextField</code> Scala docs (you'll have to find your way to <code>scala.swing.Component</code>). This is how it could look:<br /><pre>import swing._<br />import event._<br /><br />//...<br /><br />// Inside some GUI component ...<br />val textField = new TextField(20)<br />contents += textField<br /><br />listenTo(textField.keys)<br /><br />import Key._<br />reactions += {case KeyPressed(`textField`, Enter, _, _) =><br /> println("Ok, searching DB for input "+ textField.text)<br />}<br />//...</pre><br /><br />Thanks to Ingo Maier for <a href="http://old.nabble.com/Re%3A-Re%3A-swing.TextField-generates-EditDone-event-when--losing-focus-p26487967.html">explaining</a> this.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com3tag:blogger.com,1999:blog-3840687515615686738.post-68785076901611514922009-11-16T15:10:00.018+01:002009-11-16T16:01:19.210+01:00Source's getLines in Scala 2.8 now strips line endIn Scala 2.8 (not yet officially released), <code>scala.io.Source</code> has been updated.<br /><br />When reading lines from a file, you do not longer need to trim the lines, since newlines are removed by default. The code to read lines from a file using <code>Source</code> may now look something like this (where <code>fName</code> is a file name (a string)):<br /><pre style="color: rgb(51, 0, 0);">val lines = io.Source.fromPath(fName) getLines()</pre><br />If you want to specify the input file encoding to be UTF-8, you could try this:<br /><pre><span style="color: rgb(51, 0, 0);">val lines = io.Source.fromPath(fName)(</span><span style="color: rgb(0, 102, 0);">"UTF8"</span><span style="color: rgb(51, 0, 0);">) getLines()</span></pre><br />When you look at the API documentation, you'll find that <code>fromPath</code> takes a <code>Codec</code> as a second implicit parameter. Through some mysterious conversion (or "implicit conversion"), you can call it with a string (<span style="color: rgb(0, 102, 0);">"UTF8"</span>) instead, as in the example above.<br /><br />Anyway, no more <code>Source.fromFile(fName).getLines.map(_.stripLineEnd)</code>. Someone is improving Scala!Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com4tag:blogger.com,1999:blog-3840687515615686738.post-22430522342289288652009-08-17T11:05:00.004+02:002009-08-17T11:32:51.680+02:00Cracker: "Tired of coding Perl"Perl is doomed.<br /><br />Even <a href="http://en.wikipedia.org/wiki/Cracker_%28band%29">Cracker</a> is tired of it. The evidence is found in a song on their latest album, where they sing "I'm tired of coding Perl, tired of V.B.A."<br /><br />Take a look some 30 seconds into the video. The guy browsing the mod_perl Developer's cookbook is not happy. The "Turn on, tune in, drop out with me" video is <a href="http://www.youtube.com/watch?v=HyxIrfjla88">here</a>.<br /><br />Perl is doomed.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-73085230689075010642009-07-29T16:00:00.008+02:002009-07-30T09:45:14.681+02:00Scala case classes don't have auxiliary constructors?The lesson of today, is that Scala case classes don't appear to have auxiliary constructors.<br /><br />In Scala, auxiliary constructors may be added to a class by defining a "<code>this</code>" method:<br /><pre><br />scala> class AClass(s1: String, s2: String) {<br /> def this(s: String) = this(s, "default")<br />}<br />defined class AClass<br /><br />scala> new AClass("hey")<br />res0: AClass = AClass@187b5ff<br /></pre><br /><br />Look what happens when you try the same trick on a case class:<pre><br />scala> case class ACaseClass(s1: String, s2: String) {<br /> def this(s: String) = this(s, "default")<br />}<br />defined class ACaseClass<br /><br />scala> ACaseClass("hey")<br /><console>:7: error: wrong number of arguments for method apply: (String,String)ACaseClass in object ACaseClass<br />ACaseClass("hey")<br />^<br /></console></pre><br /><br />The attempt at adding an auxiliary constructor compiles, but results in a runtime error.<br /><br /><span style="font-weight: bold;">Update:</span> Oops, yes the can have auxiliary constructors --- see comment below, by jkriesten, straightening things out!<br /><br /><span style="font-weight: bold;">Update:</span> Paul (see comment below) points to the following discussion on this topic <a href="http://www.scala-lang.org/node/976">http://www.scala-lang.org/node/976</a>.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com4tag:blogger.com,1999:blog-3840687515615686738.post-38313851146863649572009-06-09T21:53:00.024+02:002009-06-09T22:38:52.208+02:00Printing the Unicode code points of UTF8 characters (Scala)Sometimes it is useful to be able to print the Unicode code point of a UTF8 character. (For instance, when you need to check if you mistakenly use a similar looking character instead of the one you're supposed to use.)<br /><br />Using Scala's RichString's format method, you can create a string of a zero padded, four digit, hexadecimal Unicode number, for example of the <code>'ä'</code> character, like this:<br /><pre>scala> <span style="color: rgb(102, 0, 0);">"%04X"</span><span style="color: rgb(0, 51, 0);">.format('ä'.toInt)</span><br />res0: String = 00E4<br /><br />scala></pre><br /><br />Here's a related example, printing a tab separated list of some IPA (phonetic) characters and their Unicode code points in a format suitable for using in Scala/Java strings:<br /><pre>scala> <span style="color: rgb(102, 0, 0);">"ɸβfvθðszʃʒʂʐçʝxɣχʁħʕʜ"</span>\<br /><span style="color: rgb(0, 51, 0);">.map(c =></span> <span style="color: rgb(102, 0, 0);">"%s\t\\u%04X"</span><span style="color: rgb(0, 51, 0);">.format(c, c.toInt))</span>\<br /><span style="color: rgb(0, 51, 0);">.foreach(println)</span><br />ɸ \u0278<br />β \u03B2<br />f \u0066<br />v \u0076<br />θ \u03B8<br />ð \u00F0<br />s \u0073<br />z \u007A<br />ʃ \u0283<br />ʒ \u0292<br />ʂ \u0282<br />ʐ \u0290<br />ç \u00E7<br />ʝ \u029D<br />x \u0078<br />ɣ \u0263<br />χ \u03C7<br />ʁ \u0281<br />ħ \u0127<br />ʕ \u0295<br />ʜ \u029C<br /><br />scala></pre>(The line terminating backslashes in the Scala code are added to indicate the fact that the above is a one-liner that doesn't fit the page. Remove these and the newlines if you want to run the code in the Scala shell.)<br /><br />Knowing the codepoints can be useful, e.g. when you don't want to or can't input non-ASCII characters into your code:<br /><pre>scala> <span style="color: rgb(0, 51, 0);">var v = </span><span style="color: rgb(102, 0, 0);">"\u0278"</span><br />v: java.lang.String = ɸ<br /><br />scala></pre><br /><br /><br />In Java, it looks similar, but you have to cast your chars to ints:<br /><br /><code><span style="color: rgb(0, 51, 0);">String.format(</span><span style="color: rgb(102, 0, 0);">"%04X"</span><span style="color: rgb(0, 51, 0);">, (int) 'ä')</span></code>, etc.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-18161136542856939162009-03-24T09:26:00.033+01:002009-03-24T22:05:53.725+01:00The perils of changing the case of UTF8 stringsBelow are a few examples of what happens to some just slightly exotic UTF8 strings when up-cased and then down-cased again. The German ß (Eszett) doesn't have an uppercase variant, and becomes two characters. The Greek Sigma has one uppercase variant, but two different lowercase versions: one word final (ς); one for other positions (σ) (explaining my not-so-very-amusing <a href="http://nikolajlindberg.blogspot.com/2009/03/scala-reversing-string-by-up-and.html">joke</a> in an earlier post).<br /><br />In the table below, you'll find two other Greek lowercase characters that don't like to be up-cased, ΰ and ΐ. These two characters ultimately become six (see the length columns).<br /><br />Last, the Turkish variants of <i>, always trusty when it comes to creating confusion (in a computer). The last but one row is interesting, since the original string is severely damaged. In the last row, the proper locale ("tr") is used, and the same string ends up in a much better condition.<br /><br />The table was generated using Scala (thus Java) strings. The column <code>EqIgnoreCase</code> reports the result of comparing the original string and the up-cased and then down-cased version of that string using Scala's/Java's <code>equalsIgnoreCase</code>. The two rightmost columns present the length of the string before and after changing the case up and down again.<br /><br /><div><table style="background-color:black;" border="0" cellpadding="3" cellspacing="1"><thead><tr><th align="left"><span style="color:white;"><b>Orig</b></span></th><th align="left"><span style="color:white;"><b>UpCase ↑</b></span></th><th align="left"><span style="color:white;"><b>UpDown ⇅</b></span></th><th align="right"><span style="color:white;"><b>EqIgnoreCase</b></span></th><th align="right"><span style="color:white;"><b>OrigLen</b></span></th><th align="right"><span style="color:white;"><b>NewLen</b></span></th></tr></thead><tbody style="background-color: white;"><tr><td>ß</td><td>SS</td><td>ss</td><td>false</td><td align="right">1</td><td align="right">2</td></tr><tr><td>ςσ</td><td>ΣΣ</td><td>σς</td><td>true</td><td align="right">2</td><td align="right">2</td></tr><tr><td>ΰΐ</td><td>Ϋ́Ϊ́</td><td>ΰΐ</td><td>false</td><td align="right">2</td><td align="right">6</td></tr><tr><td>iİıI</td><td>IİII</td><td>iiii</td><td>true</td><td align="right">4</td><td align="right">4</td></tr><tr><td>iİıI</td><td>İİII</td><td>iiıı</td><td>true</td><td align="right">4</td><td align="right">4</td></tr></tbody></table></div><br /><br /><div style="line-height:1.6">The lesson? Nothing special. That you can do terrible things to strings. That changing the case of strings may be an irreversible operation. That if you are to normalize some text into either lower or uppercase, you might need to decide what's most suitable for a given language. That it might be a good idea to keep the original strings after normalization. That using the correct locale might help. That I'm not a graphical designer (the table is hideous).</div>Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-58585500047644516022009-03-08T00:14:00.005+01:002009-03-08T00:23:21.460+01:00Scala: Reversing a string by up- and then downcasing itDid you know that you can reverse a string by merely upcasing it and then downcasing it again? Here's an example:<br /><pre>scala> <span style="color: rgb(102, 0, 0);">val s =</span> <span style="color: rgb(0, 102, 0);">"ςσσ"</span><br />s: java.lang.String = ςσσ<br /><br />scala> <span style="color: rgb(102, 0, 0);">s.toUpperCase.toLowerCase == s.reverse.toString</span><br />res0: Boolean = true<br /><br />scala></pre><br />If you don't believe me, just copy and paste the two lines of code above into the Scala interpreter, and see it for yourself.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-15512956099636266002009-03-05T11:26:00.014+01:002009-03-09T12:19:08.140+01:00The Firebird database: Problem handling UTF8 charactersThe 'Latin capital letter I with dot above', İ (Unicode 0130), strikes again! This innocent looking Turkish character seems to be reliable when it comes to breaking software that should be able to handle UTF8. (See also <a href="http://nikolajlindberg.blogspot.com/2008/03/beware-of-java-comparing-turkish.html">this post</a> for a Java example.)<br /><br />This time it breaks the <a href="http://www.firebirdsql.org/">Firebird</a> database (in my case, v2.1.1 on a 64-bit Debian system). Downcasing some random characters in a database configured to handle UTF8 works fine:<br /><br /><code><span style="color: rgb(102, 0, 0);">SELECT LOWER(<span style="color: rgb(0, 102, 0);">'</span></span><span style="color: rgb(0, 102, 0);">AӴЁΪΣƓ</span><span style="color: rgb(102, 0, 0);"><span style="color: rgb(0, 102, 0);">'</span>) FROM RDB$DATABASE</span></code><br /><br />returns the expected string, <code style="color: rgb(0, 102, 0);">aӵёϊσɠ</code>.<br /><br />However, when you throw in the trouble-making <code style="color: rgb(0, 102, 0);">İ</code>, everything blows up:<br /><pre><span style="color: rgb(102, 0, 0);">SELECT LOWER(</span><span style="color: rgb(0, 102, 0);">'AӴЁΪΣƓİ'</span><span style="color: rgb(102, 0, 0);">) FROM RDB$DATABASE</span><br />*** IBPP::SQLException ***<br />Context: Statement::Fetch<br />Message: isc_dsql_fetch failed.<br /><br />SQL Message : -104<br />Invalid token<br /><br />Engine Code : 335544849<br />Engine Message :<br />Malformed string</pre><br />Slightly different input, generates a different error message:<br /><pre><span style="color: rgb(102, 0, 0);">SELECT LOWER(</span><span style="color: rgb(0, 102, 0);">'İA'</span><span style="color: rgb(102, 0, 0);">) FROM RDB$DATABASE</span><br />*** IBPP::SQLException ***<br />Context: Statement::Fetch<br />Message: isc_dsql_fetch failed.<br /><br />SQL Message : -802<br />Arithmetic overflow or division by zero has occurred.<br /><br />Engine Code : 335544321<br />Engine Message :<br />arithmetic exception, numeric overflow, or string truncation<br /></pre><br />There is an <a href="http://tech.groups.yahoo.com/group/firebird-support/message/100535?var=1&l=1">item</a> on the Firebird user list, but without any answers so far.<br /><br /><span style="font-weight: bold;">Update:</span> As mariuz points out in a comment below, this defect now seems to be fixed in an upcoming version. See <a href="http://tracker.firebirdsql.org/browse/CORE-2355">this</a> bug tracker item.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com4tag:blogger.com,1999:blog-3840687515615686738.post-80589807711091060222009-01-06T13:06:00.020+01:002009-01-09T19:48:40.749+01:00Book: Real World Haskell (not much real world so far :)I've just started to read <a href="http://www.realworldhaskell.org/">Real World Haskell</a> (the paper book). It seems like a nice book (except for a few irritating and confusing typos/mistakes at the start of the book).<br /><br />However, I've read more than 100 pages so far, and still not a sign of any of the "real world" stuff promised by the title. I still don't know much or anything about IDE:s, how to compile the code, scripting, any practical details on how to structure your code into modules, or anything in that direction. So far, mostly (sometimes rather long-wined) discussions on specific (list) functions. One of the examples, end up in a conclusion that might be paraphrased as "by the way, don't use the function we've discussed the last few pages; in real world settings it doesn't work too well".<br /><br />In the real world, you run into both needles and haystacks , occasionally, but that doesn't help making sense of<pre style="color: rgb(102, 0, 0);">isInAny3 needle haystack = any (isInfixOf needle) haystack</pre>And one more real world example of the kind <code style="color: rgb(102, 0, 0);">zip3foobar "quux"</code> and I may start losing interest... or just start screaming.<br /><br />Well, the upcoming chapters have promising titles, so I guess I just have to keep reading. And I guess you have to start with the basics. Still, over 100 pages, and mostly foobars so far...<br /><br />The book is available <a href="http://book.realworldhaskell.org/read/">on-line</a>.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com2tag:blogger.com,1999:blog-3840687515615686738.post-28030421698277848452008-12-18T23:33:00.027+01:002009-01-06T16:51:27.330+01:00Scala for small throw-away scripting tasksI've come to use Scala for tiny scripts to be thrown away after doing some small task. Typically this involves processing a few files, comparing some textual data, maybe extracting some fields of tab-separated files, etc. The kind of things that Perl used to be the obvious choice for.<br /><br />Although lacking Perl's simplified syntax for iterating over all lines in files, Scala works quite nicely for small tasks.<br /><br />For example, today I had to extract from a file all lines of four or more characters including only upper-case characters, and capitalize the output:<pre style="color: rgb(102, 0, 0);">scala.io.Source.fromFile(args(0))<br />.getLines.map(_.stripLineEnd).filter(_.matches("[A-Z]{4,}"))<br />.map(_.toLowerCase.capitalize).foreach(println)</pre>Not exactly a thing of beauty, but it only took a minute and it works. And it reminds me a bit of a classic Unix command line pipeline.<br /><br />A few things on my wish-list to make Scala even better for small scripts:<br /><ul><li>A nicer way of setting the output character encoding (currently you have to do something like <code>Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8"))</code>)</li><li> It would be great if <code>Source.getLines</code> could remove the new line character of each line</li><li>A better name for <code>RichString.stripLineEnd</code> (for some reason, it is totally impossible for me to remember the name of this method)<br /></li><li>Maybe scripting support in the Scala Netbeans plugin? (Currently, I think the plugin wants you to put your code in a class/object)<br /></li></ul>Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com3tag:blogger.com,1999:blog-3840687515615686738.post-14840551387297950572008-12-12T12:01:00.013+01:002008-12-12T12:37:26.898+01:00Scala: Reading a tab separated file into a Map (first attempt)Below is my first attempt, in Scala, at reading a tab separated file into a map, where the first and second fields of the input file make up the key-value pairs.<br /><br />There are probably better ways of doing it, but the following seems to work:<pre style="color: rgb(102, 0, 0);">val keyValuePairs = scala.io.Source.fromFile(inputFileName, "UTF8")<br /> .getLines.map(_.stripLineEnd.split("\t", -1))<br /> .map(fields => fields(0) -> fields(1)).toList<br /><br />val map = Map(keyValuePairs : _*)</pre><br />The <code>keyValuePairs:_*</code> stuff is a way to call a variable length argument, the constructor of (the immutable) <code>Map</code>, with a list (<code>keyValuePairs</code>).<br /><br />I'm pretty sure that there are neater ways of doing it. Furthermore, the above snippet does not do any sensible error checking or input validation (such as skipping empty line, for instance).Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-19110217217049561852008-12-11T16:26:00.009+01:002009-01-11T15:27:25.176+01:00Intelligent Software: Netbeans (or JUnit?) can count to three!I just noticed a (very) small detail in Netbeans. I was adding some unit test, when I noticed that Netbeans can count to, at least, three.<br /><br />When running a JUnit test suite of only one test, you get the message "The test passed". After adding another test, the message is "Both tests passed", then "3 tests passed", etc. (Well, of course, given that the tests pass.)<br /><br />Now, that's what I call (artificial) intelligence.<br /><br /><a href="http://www.economist.com/science/displaystory.cfm?story_id=12847128">Here's</a> an unrelated article on counting to three (and more).Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-69374334381253403632008-12-09T22:01:00.010+01:002008-12-18T23:33:17.476+01:00Scala: Beware of inadvertently shadowing variablesI've just spent 15 minutes looking for a stupid mistake in some Scala code. The problem was that I had shadowed a variable.<br /><br />In some situations in Scala, you are allowed to <span style="font-style: italic;">shadow</span> variables. In other words, it is sometimes legal to give a new variable the same name as an existing one. This can lead to mistakes. The following legal code illustrates how you can shadow a method input variable:<br /><pre><span style="color: rgb(102, 0, 0);">def theShadow(list :Array[String]) : Seq[String] = {</span><br /><span style="color: rgb(0, 102, 0);"> // Mistake! Inadvertently</span><br /><span style="color: rgb(0, 102, 0);"> // shadowing the input parameter:</span><br /><span style="color: rgb(102, 0, 0);"> val list = List("Asa", "nisi", "masa")</span><br /><span style="color: rgb(102, 0, 0);"> list</span><br /><span style="color: rgb(102, 0, 0);"> }</span></pre><br /><br />(The above is a very obvious example. When you make this mistake in real code, it will probably be in a less obvious context.)Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0tag:blogger.com,1999:blog-3840687515615686738.post-55248907508612679622008-12-09T12:48:00.017+01:002008-12-12T14:58:29.967+01:00Scala: XML serializer adds closing elements to empty elementsWhen printing Scala XML nodes/elements, closing tags for empty elements are added, even if there weren't any in the input.<br /><br />For example, if you input <code><childless/></code>, the XML processor will add a closing tag like this:<pre>scala> <span style="color: rgb(102, 0, 0);">val elem = <childless/></span><br /><span style="color: rgb(102, 0, 0);">elem: scala.xml.Elem = <childless></childless></span></pre><br />(The two versions of the XML element are equivalent, but sometimes it is practical to be able to do a simple string comparison of the input and output XML files. The added closing tags may make this harder.)<br /><br /><br />See <a href="http://www.nabble.com/minor-XML-question-td19157873.html#a19157873">this</a> thread.Nikolaj Lindberghttp://www.blogger.com/profile/12153448128671603936noreply@blogger.com0