regex doesn't support i (ignorecase) flag

Project:GNU Smalltalk
Component:VM
Category:bug
Priority:normal
Assigned:Unassigned
Status:fixed
Attachment:latin1-re-ignorecase.patch (2.58 KB)
Description

Example:

st> ('a' =~ '(?i:A)') inspect!
An instance of Kernel.FailedMatchRegexResults

I found that this is because pre_set_casetable in lib-src/regex.c is never called. This is fixed in scompall@nocandysw.com--2007-nocandy/smalltalk--backstage--2.2--patch-62, "support (?i:...) in regexps".

st> ('a' =~ '(?i:A)') inspect!
An instance of Kernel.MatchingRegexResults

There are multiple solution paths, because case folding is charset-dependent. The patch implements #3:

  1. Always import I18N and use the locale database to determine the charset of Strings. I'm not sure what the exact semantics of this would be.
  2. Assume ASCII. regex.c already effectively assumes that strings are somewhat ASCII-compatible, and this wouldn't bias in favor of a particular ASCII superset.
  3. Assume Latin-1. This has the benefit of offering a clear behavior path to future support for matching full Unicode strings, so it's what the patch uses.
  4. Assume Latin-9. Technically this supersedes Latin-1, so is more up-to-date, but is not a codepoint-wise subset of Unicode.

Updates

#1 submitted by Paolo Bonzini on Tue, 10/02/2007 - 07:21

> * Assume ASCII. regex.c already effectively assumes that strings
> are somewhat ASCII-compatible, and this wouldn't bias in favor of a
> particular ASCII superset.

I believe this is the best.

> * Assume Latin-1. This has the benefit of offering a clear
> behavior path to future support for matching full Unicode strings, so
> it's what the patch uses.

Not really, because UTF-8 is not a superset of Latin-1. All of eastern Europe, plus Greece, plus most of Africa/Asia/Australia do not use Latin-1.

If you don't mind some conflicts, I can adapt the patch you attached. Otherwise, you can do the change yourself and I'll cherrypick both changesets.

#2 submitted by Stephen Compall on Tue, 10/02/2007 - 17:19
Attachment:ascii-re-ignorecase.patch (1.39 KB)

smalltalk--backstage--2.2--patch-63 in combination with the previous patch changes the downcase table to ASCII.

#3 submitted by Paolo Bonzini on Wed, 10/03/2007 - 07:03
Status:patch» committed

Thanks, will apply soon.

#4 submitted by Paolo Bonzini on Wed, 10/03/2007 - 11:50
Status:committed» fixed

Applied.

User login