Språkbanken has published a large, freely available Swedish lexicon, Saldo, "a Swedish basic language resource". The release appears to include some 68,000 uninflected lemma forms as well as more than 740,000 expanded (full) word forms. There is morphologic and semantic information.
This resource should be valuable for part-of-speech tagging, lemmatizers, spell-checking, (semantic) analysis of Swedish text, etc, etc.
The release includes software for interfacing with the lexicon. (Those of you into functional programming might be interested in the fact that the lexicon software is written in Haskell.)
It is released under the LGPL license.
Wednesday, 21 May 2008
Saldo 1.0: Large, freely available Swedish morphologic and semantic lexicon
Friday, 16 May 2008
Scala one-liner for upcasing lines of text
The following is a Scala script that up-cases each line of an UTF8 encoded input file (args(0)
) and prints the result to standard output:
import scala.io.Source
Console.setOut(new java.io.PrintStream(Console.out,true,"UTF8"))
Source.fromFile(args(0), "UTF8").getLines.foreach(line => print(line.toUpperCase))
If you're trusting the default character encoding to work for you, you may reduce it to:
import scala.io.Source
Source.fromFile(args(0)).getLines.foreach(line => print(line.toUpperCase))
Another way to do it, is to read the lines into an iterator, using the iterator's
.map
method to upcase each line:
import scala.io.Source
val lines = Source.fromFile(args(0)).getLines.map(_.toUpperCase)
lines.foreach(print)
A Java programmer may be relieved (or horrified) to learn that Scala does not have any checked exceptions. There are only runtime exceptions, and you don't need to add any try/catch statements if you don't want to.
When you run a Scala script, you can instruct the Scala interpreter to compile the script, and use the compiled version (a jar file) if it's younger than the source-file. This gives better performance (shorter start-up, etc). You use the
savecompiled
command line argument.
Saturday, 10 May 2008
The Scala programming language and XML
The Scala programming language is a combined scripting and "proper" language, that sits on top of the Java VM. You can either run scripts similar to how you run a Ruby or Perl script, or compile your Scala classes to Java bytecode. You run a Scala application similar to how you run a Java application. You can also run a Scala application using the Java VM (but you have to add the Scala library jar file to your class path). You can mix Java and Scala programs, calling Scala objects from Java, and vice versa.
Scala has a feature that I have never seen in a language suitable for general programming: XML (processing) as a feature of the language. The people behind Scala has added XML to the syntax of the language itself. You do not have to load some library or use some special API for processing XML, since it's already part of the language.
It is not only that XML is valid in Scala code, but XML has its own built-in data types. For instance,
val xml = <vegetable>potato</vegetable>is a valid Scala statement. In other words, XML-elements written in a Scala program are just not merely strings. The
xml
object can now be manipulated in various ways, much like a DOM
object in Java (but with less hassle than in Java).You can refer to variables in your XML:
val veg = "potato"
val col = "white"
val xml = <vegetable colour={col}>{veg}</vegetable>
// The value of xml now corresponds to
// <vegetable colour="white">potato</vegetable>
You can also embed function/method calls into XML elements. Imagine that you have a method that returns a sequence of
n
XML elements, like this (you'll need to import scala.xml.NodeSeq
and scala.xml.NodeBuffer
):def genNumElems(n :Int) :NodeSeq = {(The odd-looking
val result = new NodeBuffer
for(i <- 1 to n) {
result &+ (<number value={i.toString}/>)
}
result
}
&+
operator means "add".)You can now embed a call to
genNumElems
in an XML element, e.g., like this:val numList = <number_list>{genNumElems(4)}<number_list/>
Printing
numList
produces:<number_list>
<number value="1"></number>
<number value="2"></number>
<number value="3"></number>
<number value="4"></number>
</number_list>
If you want nicer output, you can use a
PrettyPrinter
(that you import from scala.xml._):val pp = new PrettyPrinter(100,2) // width and indentation
println(pp.format(xml))
Reading and writing XML data/files to and from Scala is easy. The following is a one-liner that reads an XML file given as a command line argument (
args(0)
) and returns a list of all elements named "tr" that are child elements to any elements called "table" of the XML file:val trNodes = scala.xml.XML.loadFile(args(0)) \\ "table" \ "tr"You may print the
<tr>
elements (with an empty line between each element) thus:trNodes.foreach(tr => println(tr + "\n"))
The built-in XML support in Scala's syntax and basic libraries are not the most important or interesting features of Scala, but they sure seem to be very useful.
(Incidentally, the
table
and tr
elements above are present in Oocalc's (OpenOffice.org) XML format for spreadsheets.)Update: It appears that is not always good advice to use
scala.xml.XML.loadFile
to read an XML document. One reason is that comment elements are lost. For more advanced XML processing, one should turn to scala.xml.parsing.ConstructingParser.fromFile
.Update: You may run into trouble when processing larger XML documents using the second approach. See this comment.