UnicodeString encoding weirdness
| Project: | GNU Smalltalk |
| Component: | Base classes |
| Category: | bug |
| Priority: | normal |
| Assigned: | bonzinip |
| Status: | fixed |
| Attachment: | unitest2.st.txt (849 bytes) |
Take the attached program. Which prints here:
34
4
E3 <-> EF
81 <-> BF
AA <-> BE
E3 <-> E6
81 <-> A8
BE <-> B0
E3 <-> E7
81 <-> B8
9F <-> B0
But should print (at least as far as my understanding in
Unicode and encodings goes):
3
3
E3 <-> E3
81 <-> 81
AA <-> AA
E3 <-> E3
81 <-> 81
BE <-> BE
E3 <-> E3
81 <-> 81
9F <-> 9F
Updates
| Attachment: | gst-encoding-lazy.patch (594 bytes) |
EF-BF-BE is the unicode "byte order mark" (BOM) encoded in UTF-8. It was born as a way to distinguish big- and little-endian UTF-16. Since it's not really a character, Iconv tries to strip it when converting to a UnicodeString, but it is failing to do so in this case.
Now, under Mac OS X I get the expected result, under Linux I get yours. The reason is that my Mac is big-endian, so Iconv produces big-endian UTF-16, while Linux produces little-endian UTF-16. Since the default encoding of UTF-16 is big-endian, the Mac happens to get the right thing, while Linux messes up the encoding. So later on the "pipe peekFor: $<16rFEFF>" statement to strip the BOM does not work.
The attached patch fixes this by making EncodedString look for a BOM when retrieving the encoding, rather than when setting it.
| Assigned to: | Unassigned | » bonzinip |
| Status: | active | » fixed |
fixed in patch-612, which is the same patch I posted plus this testcase
str := EncodedString fromString: (String new: 2) encoding: 'UTF-16'. str valueAt: 1 put: 254; valueAt: 2 put: 255. self assert: str numberOfCharacters = 0. str valueAt: 1 put: 255; valueAt: 2 put: 254. self assert: str numberOfCharacters = 0
Thanks!
