Hunting down CP1252 with Iconv
I was looking for ways to improve a character encoding guesser in PHP, and decided that it was time to look at Iconv.
One important step to understanding how encoding or "charsets" works in GST is to understand that it is very similar to the ISO C90 model. We reinterpret the traditional String just as C strings were reinterpreted; a String is no longer quite what it appears to be.
Here is an example to demonstrate on my system, once Iconv is loaded:
st> 254 asCharacter asString inspect!
An instance of String
contents: [
[1]: $<16r00C3>
[2]: $<16r00BE>
]
This means that 254 = 254 asCharacter asString first codePoint is not always true, which is pretty strange at first glance.
The difference comes because when Iconv is loaded, converting integers to characters is always done Unicode-wise. The multi-"character" representation is really the multibyte representation, produced because I have my locale's character encoding fixed to UTF-8, and 0xC3 0xBE is the encoding for Unicode codepoint 254, þ, in UTF-8.
In short, you can't treat the things you get with #at: sent to strings as anything other than fancy bytes anymore; they aren't really Characters.
Fortunately, dealing with individual characters is so unimportant that people internationalizing applications only have a few bugs to fix. However, because you will probably have to deal with encodings other than the locale's (for example, you might provide a config file or documentation file in Latin-1, where a user might be set up for ISO-2022-JP interaction), it pays to understand the relationships between the various types.
Using Iconv
I have already said how Characters change; how do the other classes fit into Iconv? A simple rule is to remember that Strings and ByteArrays are encoded forms, and are almost exactly the same. UnicodeStrings are the unencoded forms, because the interpretation of their codepoints is always the same: direct Unicode.
To convert ByteArrays or Strings to UnicodeStrings, use #asUnicodeString:; if you leave off the argument, you'll get the locale's charset. This, of course, decodes using the given charset, specified as a string like 'UTF-8' or 'ISO-8859-15'.
To convert UnicodeStrings to Strings or ByteArrays, use #asString:. Again, if you leave off the argument you get the locale's charset.
The mission
My mission was to pick out the codepoints that change between Latin-1, the traditional US/Western Europe charset, and CP1252, a charset traditionally popular on Windows. The problem is that CP1252 allocates some control characters >=128 to graphical characters, and some editors lie and claim that CP1252 is Latin-1. I started by just reading the list of differences from Wikipedia, making a regex from it:
{128 to: 128. 130 to: 140. 142 to: 142.
145 to: 156. 158 to: 159} do: [:iv | iv do: [:n |
Transcript display: '\x'; display: (n printString: 16)]]!
\x80\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8E\x91\x92\x93\x94\x95\x96...
I decided to check my work and play with iconv anyway, so keeping in mind that both Latin-1 and CP1252 are single-byte encodings and that Latin-1 is a codepoint-wise subset of Unicode:
st> 0 to: 255 do: [:byte | | latin1 cp1252 |
latin1 := (ByteArray with: byte) asUnicodeString: 'ISO-8859-1'.
cp1252 := [(ByteArray with: byte) asUnicodeString: 'CP1252']
on: I18N.InvalidSequenceError
do: [:ex | ex return: nil].
(cp1252 isNil or: [latin1 = cp1252]) ifFalse:
[Transcript display: '%1: %2 -> %3'
% {cp1252. byte. cp1252 first codePoint}; nl]]!
€: 128 -> 8364
‚: 130 -> 8218
ƒ: 131 -> 402
„: 132 -> 8222
…: 133 -> 8230
†: 134 -> 8224
‡: 135 -> 8225
ˆ: 136 -> 710
‰: 137 -> 8240
Š: 138 -> 352
‹: 139 -> 8249
Œ: 140 -> 338
Ž: 142 -> 381
‘: 145 -> 8216
’: 146 -> 8217
“: 147 -> 8220
”: 148 -> 8221
•: 149 -> 8226
–: 150 -> 8211
—: 151 -> 8212
˜: 152 -> 732
™: 153 -> 8482
š: 154 -> 353
›: 155 -> 8250
œ: 156 -> 339
ž: 158 -> 382
Ÿ: 159 -> 376
I also discovered that, surprisingly, there are some bytes in 0..255 that are not valid CP1252, which isn't the case for Latin-1. A simple change to the exception handler for invalid CP1252 sequences above could log these as well; I left it out, being confident of the text I'm working with, but you might include that in a more precise charset guesser.
UTF-8
Finally, I strongly recommend that you contribute to the discovery of multibyte bugs and help to make encoding guessing easier by always using UTF-8, including on your terminal.
I first adopted this policy when I found that CLISP provides 5 standard custom charset fields for different purposes, and realized that I wouldn't need to keep in mind all the nuances involved with their differences if I used a charset everywhere that could encode all of Unicode.

> This means that
254 = 254 asCharacter asString first codePoint> is not always true, which is pretty strange at first glance.
Right. OTOH,
254 = 254 asCharacter asUnicodeString first codePoint.