UnicodeString encoding weirdness

Project:GNU Smalltalk
Component:Base classes
Category:bug
Priority:normal
Assigned:bonzinip
Status:fixed
Attachment:unitest2.st.txt (849 bytes)
Description

Take the attached program. Which prints here:

3
4

4
E3 <-> EF
81 <-> BF
AA <-> BE
E3 <-> E6
81 <-> A8
BE <-> B0
E3 <-> E7
81 <-> B8
9F <-> B0

But should print (at least as far as my understanding in
Unicode and encodings goes):

3
3

3
E3 <-> E3
81 <-> 81
AA <-> AA
E3 <-> E3
81 <-> 81
BE <-> BE
E3 <-> E3
81 <-> 81
9F <-> 9F

Updates

#1 submitted by Paolo Bonzini on Mon, 10/22/2007 - 09:01
Attachment:gst-encoding-lazy.patch (594 bytes)

EF-BF-BE is the unicode "byte order mark" (BOM) encoded in UTF-8. It was born as a way to distinguish big- and little-endian UTF-16. Since it's not really a character, Iconv tries to strip it when converting to a UnicodeString, but it is failing to do so in this case.

Now, under Mac OS X I get the expected result, under Linux I get yours. The reason is that my Mac is big-endian, so Iconv produces big-endian UTF-16, while Linux produces little-endian UTF-16. Since the default encoding of UTF-16 is big-endian, the Mac happens to get the right thing, while Linux messes up the encoding. So later on the "pipe peekFor: $<16rFEFF>" statement to strip the BOM does not work.

The attached patch fixes this by making EncodedString look for a BOM when retrieving the encoding, rather than when setting it.

#2 submitted by Paolo Bonzini on Mon, 10/22/2007 - 09:25
Assigned to:Unassigned» bonzinip
Status:active» fixed

fixed in patch-612, which is the same patch I posted plus this testcase

  str := EncodedString fromString: (String new: 2) encoding: 'UTF-16'.
  str valueAt: 1 put: 254; valueAt: 2 put: 255.
  self assert: str numberOfCharacters = 0.
  str valueAt: 1 put: 255; valueAt: 2 put: 254.
  self assert: str numberOfCharacters = 0

Thanks!

User login