Optimizing stream operations
One of the improvements in GNU Smalltalk 3.1 was the optimization of block operations on Streams. Basically, methods such as #next: and #nextPutAll: were rewritten to use fast collection operations instead of copying data element by element to the stream. Since #replaceFrom:to:with:startingAt: (the basic Smalltalk "move" operation) ultimately boils down to a memmove call, this is extremely fast.
But was this the end of the story? Today, after getting some inspiration from Nicolas Petton, I decided to try profiling Seaside and Swazoo.
I didn't go for anything very scientific. I just made an image with Seaside ready to serve:
gst-load --start Seaside Seaside-Development Seaside-Examples
I fired up gst-profile on this script
(PackageLoader packageAt: 'Seaside') start. (Delay forSeconds: 40) wait
Then I had 40 seconds to interact with the browser until the profiler exited and wrote its output file (which is compatible with callgrind_annotate). The profiler is a new feature of the upcoming 3.2 version.
I tried a few times and checked that the profiles I obtained were similar even if I visited different Seaside examples. So, I could pretend that I was "scientifically" reproducing the same scenario---even though I was clicking almost randomly. :-)
Here are the resulting top methods:
- WriteStream>>nextPut: (15%)
- PositionableStream>>next (8%)
- Seaside.WAEncoder>>nextPutAll: (5.7%)
- WriteStream>>next:putAll:startingAt: (5.5%)
This is not a nice profile... The nice block-based operation #next:putAll:startingAt: is being called, but more than 22% of the bytecodes being executed are just moving data one byte at a time. While there is some unavoidable byte-level processing (due to the third method, WAEncoder, which escapes HTML entities and more or less must operate one byte at a time), working on bigger amounts of data is simply more efficient.
Running callgrind_annotate with --inclusive=yes (which tells you, roughly speaking, which subtrees of the execution are more expensive) is even more depressing. 30% of the execution time is spent reading HTTP requests. What can we do to improve this?
Another callgrind_annotate command-line option, --tree=caller, and we have some answers. As expected, WAEncoder is one of the big callers of #nextPut:, but there are more. The lowest-hanging fruit is #upToEnd, a Stream method that returns the contents of the stream up to its end. Since HTTP requests are parsed a line at a time, the name and value are roughly obtained like this:
name := stream upTo: $:. value := stream upToEnd.
Since a header field's name is usually short, most of the time is spent in #upToEnd rather than #upTo:. So optimizing upToEnd is quite effective. And easy too:
| ws | ws := WriteStream on: (self species new: 8). self nextPutAllOn: ws. ^ws contents
#nextPutAllOn: does a fast transfer of the self stream onto ws, with no intermediate copies. It uses #next:putAll:startingAt:, so this saves both #next and #nextPut: operations. In fact, this is enough to kick #next down to the third place in the above profile, and to decrease the HTTP request parsing cost to 20%.
Next, I did some microoptimization of WAEncoder. I got it from 24 to 21 bytecodes per byte, which is not bad actually (it is called on any byte outside tag and attribute names), but the big part to optimize is anyway #nextPut: and #next.
But when I say optimize, I mean not call at all. It is (almost) impossible to save bytecodes on those methods, they are too short. While it would be possible to translate them to C (in fact Smalltalk-80 did that exactly!), it would not remove the overhead of the loops that surround them.
Let's look at the callers of #next:
- SwazooStream>>#nextByte (52%)
- Stream>>#upTo: (33%)
- HTTPString class>>#decodedHTTPFrom: (15%)
The latter is unavoidable; it's a Swazoo utility method that is basically the same as Seaside's WAEncoder but for URL encoding instead of HTML encoding. But it's also the least expensive, so we can move on. The first is about something I hate of Swazoo: it has so many layers between sockets and their clients, a SpSocket is stored in a SwazooSocket which is buffered by a SwazooStream which owns two SwazooBuffer objects! SwazooBuffer is a descendent of Stream, so whenever SwazooStream accesses a buffer it calls Stream>>#next.
The solution is to remove the SwazooBuffer class, but it is not so easy. I did not finish this, but got as far as I could play and profile Seaside.
What's left is Stream>>#upTo:, which is a very nice target. This for two reasons:
- It is in the base classes, so that fixing it will give speedups for basically all accesses to data that are delimited by a single character. There are a million such examples in Swazoo---not just the headers, but also GET and POST parsing (splitting by & and by equal signs), dispatching of URLs to handlers (which works one path component at a time), parsing the HTTP GET line.
- It calls both #next and #nextPut:, so you hit both targets at the same time. If we can optimize it, it will kick #next out of the profile and will leave WAEncoder and #decodedHTTPFrom: as the only important users of #nextPut:.
Now, the abstract Stream class is not very amenable to block optimizations, because it can operate on any kind of data, finite or infinite, produced in so many different ways: generators, SQL queries, XML pull parsers, random number generators, are all instances of Stream. However, we can always find an appropriate superclass and implement a refined, faster version of #upTo: there.
The correct place is the awfully-named PositionableStream, which should have been called SequenceableCollectionStream probably (the awful name is from Smalltalk-80, so it will be thirty next year). The first solution that came to my mind was to use #indexOf: to find the delimiter, and then slice the return value out of the collection up to just before the delimiter. This did bring the cost of HTTP header parsing down to 15% approximately, but had an unexpected effect (well, not so unexpected with hindsight) on the profile:
- WriteStream>>nextPut: (10%)
- Seaside.WAEncoder>>nextPutAll: (8%)
- SequenceableCollection>>indexOf:startingAt:ifAbsent: (5%)
- and then, further down, WriteStream>>next:putAll:startingAt: (2%)
What's the new guy in there? Is there a better way to implement #upTo:? The question is not stupid---the code is doing two scans now, one to find the delimiter and one to copy it. Maybe we should limit to one scan?
The answer is no. The problem is that #indexOf: is written in Smalltalk, while the bulk of #next:putAll:startingAt: will be a call to memmove. Thus, the solution is to write a primitive that calls memchr. mem* functions are so optimized (there are even instructions to implement them in the newest x86 processors; instructions that go fast, unlike the old REPNE SCASB) that it's worthless to try to write Smalltalk code competing with them.
Here is the top profile after finishing this work:
- WriteStream>>nextPut: (11%)
- Seaside.WAEncoder>>nextPutAll: (9%)
- WriteStream>>next:putAll:startingAt: (2.5%)
- WriteStream>>next (2%)
The cost of HTTP parsing is now around 10% (we started at 30%)... without touching it!!! The benefit comes entirely from rewriting the stream operations to operate on big chunks of data whenever useful. Thus, this is a great example of how these low-level optimizations will benefit GNU Smalltalk programs in general.
_EDIT_: another tweak to the base classes (reusing the #at: primitive for SequenceableCollection>>#at:ifAbsent:) and HTTP request parsing is down in the 7% range. The relevant methods are omitted from the above profiles because they are not relevant to stream operations.