regex doesn't support i (ignorecase) flag
| Project: | GNU Smalltalk |
| Component: | VM |
| Category: | bug |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | fixed |
| Attachment: | latin1-re-ignorecase.patch (2.58 KB) |
Example:
st> ('a' =~ '(?i:A)') inspect!
An instance of Kernel.FailedMatchRegexResults
I found that this is because pre_set_casetable in lib-src/regex.c is never called. This is fixed in scompall@nocandysw.com--2007-nocandy/smalltalk--backstage--2.2--patch-62, "support (?i:...) in regexps".
st> ('a' =~ '(?i:A)') inspect!
An instance of Kernel.MatchingRegexResults
There are multiple solution paths, because case folding is charset-dependent. The patch implements #3:
- Always import I18N and use the locale database to determine the charset of Strings. I'm not sure what the exact semantics of this would be.
- Assume ASCII. regex.c already effectively assumes that strings are somewhat ASCII-compatible, and this wouldn't bias in favor of a particular ASCII superset.
- Assume Latin-1. This has the benefit of offering a clear behavior path to future support for matching full Unicode strings, so it's what the patch uses.
- Assume Latin-9. Technically this supersedes Latin-1, so is more up-to-date, but is not a codepoint-wise subset of Unicode.
Updates
> * Assume ASCII. regex.c already effectively assumes that strings
> are somewhat ASCII-compatible, and this wouldn't bias in favor of a
> particular ASCII superset.
I believe this is the best.
> * Assume Latin-1. This has the benefit of offering a clear
> behavior path to future support for matching full Unicode strings, so
> it's what the patch uses.
Not really, because UTF-8 is not a superset of Latin-1. All of eastern Europe, plus Greece, plus most of Africa/Asia/Australia do not use Latin-1.
If you don't mind some conflicts, I can adapt the patch you attached. Otherwise, you can do the change yourself and I'll cherrypick both changesets.
| Attachment: | ascii-re-ignorecase.patch (1.39 KB) |
smalltalk--backstage--2.2--patch-63 in combination with the previous patch changes the downcase table to ASCII.
| Status: | patch | » committed |
Thanks, will apply soon.
| Status: | committed | » fixed |
Applied.
