Collection idioms in GNU Smalltalk (and problems thereof)

Tagged:  •    •    •    •  

Peter Norvig's blog showed an example of a toy spell checker in Python, which Michael Davies converted to Smalltalk. This problem is interesting and Michael's solution is actually quite idiomatic.

I'll show a couple more tricks that can help decrease the number of lines.

First, you can use the Bag class to count occurrences. It does internally exactly the same thing that Peter's code does using a Dictionary in Python. Loading the training set becomes simply nwords := lines findWords asBag. Also, Michael used nwords keys includes: to test presence of a key in a dictionary; that's slow and it would have been better to write nwords includesKey: word when using a dictionary, but with a bag you just write nwords includes:. Down to 54-4-5=45 lines.

Returning Sets is perfectly fine from edits1: and known_edits2: (underscores are rarely used for identifiers as Michael probably knows, anyway that's not important). By not worrying about returning Arrays you can omit more conversions. Also, if you have a new version of GNU Smalltalk (>= 3.0.2), you can rewrite these two methods like this:

edits1: word [
    | n |
    n := word size.
    ^Set join: {
        0 to: n - 2 collect: [ :i | word swapAt: i.
        1 to: n - 1 collect: [ :i | word removeAt: i ].
        alphabet gather: [ :letter | 1 to: n collect: [ :i | word replace: letter at: i ] ].
        alphabet gather: [ :letter | 0 to: n collect: [ :i | word insert: letter at: i ] ] }

knownEdits2: word [
    ^(self edits1: word) gather: [ :e1 | self known: (self edits1: e1) ] ]

For this change to work, alphabet should be an Array and not a String. I'm not sure this is a bug, it might be fixed soon. This brings it down to 39 lines.

I'll also point out that you can define swapAt:, removeAt:, insertAt: using a powerful method, copyReplaceFrom:to:with:

  swapAt: i [ ^self copyReplaceFrom: i+1 to: i+2 with: {self at: i+2. self at: i+1} ]
  removeAt: i [ ^self copyReplaceFrom: i+1 to: i+1 with: #() ]
  insert: l at: i [ ^self copyReplaceFrom: i+1 to:i with: {l} ]

I recommed against inlining these, but you can. Since these are the less readable lines in Mike's Python code, it is fair to outline them in Python too and add 5 lines to Mike's code, bringing the final comparison 26-39.

Also, to improve the readability of "bestUsing:", you can use "fold:" instead of "inject:into:". It uses the first element to start the iteration, and the block is invoked the first time with the first two elements.

Actually, #ifEmpty: and the equivalent of Python's max-accepting-a-lambda could be added to the Smalltalk standard library, actually, which would make 26-35. These final changes may be considered unfair, but I wanted to see where the extra lines coming from. The answer is:

  • lines like "Foo extend" because Python's String functions are not methods
  • class declarations (the Python code is not embedded in a class, you cannot do that in Smalltalk; passing the filename as a parameter would require additional lines in Python but not in Smalltalk)
  • variable declarations

The only method that really comes up longer is #correct:, where Python's idiomatic "or" construct allows very short code compared to Smalltalk:

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]

candidates := (self known: {word}) ifEmpty: [
    (self known: (self edits1: word)) ifEmpty: [
         (self knownEdits2: word) ifEmpty: [ {word} ] ] ].

The readability of the Smalltalk code is probably debatable (it's wide!), but it's a matter of being used to one style or the other, at least IMO.

The big problem here is speed. I thought I hit the bug about GNU Smalltalk's abysmal performance with big collections, so I tried rewriting it with generators. Well, they use continuations so I was trading one slow thing with another. :-)
I then tried Stream decorators instead -- this complicated name just means that you can apply methods like collect: or select: to Streams also. That's fast as it does not involve continuations, but repeatedly sending the concatenation message , is quadratic.

The result was 15 times slower than Python, here. After profiling I found a couple of hotspots that could be easily fixed and are responsible for ~30% of the time. But...

(update) ... it turns out that creating a Set is useless. Arrays work well enough, because in this application duplicates are quite rare. Duplicates are more important in knownEdits2:, but not even the original code by Peter Norvig rejects them there.

So, just replacing Set with Array in the implementation of edits1: above brings performance to 5 times slower. I think I'm going to make Array adopt the memcpy primitive used by String or ByteArray, which should improve performance further.

(update more) ... I had a pig in the snake. Building an Array of words and turning into a Bag was causing useless pointers from oldspace to newspace, because the array was tenured and the words were not. Constructing the bag directly saved 25% more runtime. Now I have two thirds of the time spent in the bytecode interpretation loop, which means it's harder to optimize but should also give more uniform rewards if fixed.

User login