Regular expressions

This wiki page is intended to become part of GNU Smalltalk. Please do not make major changes to it unless you have a proper future changes copyright assignment on file with the FSF, or you intend to file a copyright assignment or disclaimer for your changes with the FSF.

Regular expressions, or "regexes", are a sophisticated way to efficiently match patterns of text. If you are unfamiliar with regular expressions in general, see the GNU Emacs regex user guide, for a guide for those who have never used regular expressions.

(The above ref can be written as @ref{Regexps, Syntax of Regular Expressions, 20.5 Syntax of Regular Expressions, emacs, GNU Emacs Manual} in gst.texi.)

GNU Smalltalk supports regular expressions in the core image with methods on String.

Regex syntax in GST

The GNU Smalltalk regular expression library is derived from GNU libc, with modifications made originally for Ruby to support Perl-like syntax. GST always uses its included library, and never the one installed on your system.

Broadly speaking, these regexes support Perl 5 syntax, meaning register groups ()and repetition {} must not be given with backslashes, and their counterpart literal characters should. For example, '\{{1,3}' matches '{', '{{', and '{{{'; correspondingly, '(a)(\()' matches 'a(', with 'a' and '(' as the first and second register groups respectively.

GST regular expressions are currently 8-bit clean, meaning they can work with any ordinary String, but do not support full Unicode, even when package I18N is loaded.

Specifying regexes

In most cases, you should specify regular expressions as ordinary strings. GST always caches compiled regexes, and even uses a special high-efficiency caching when looking up literal strings (i.e. most regexes), to hide the compiled Regex objects from most code. For special cases where this caching is not good enough, simply send #asRegex to a string to retrieved a compiled form, which works in all places in the public API where you would specify a regex string. You should always rely on the cache until you have demonstrated that using Regex objects makes a noticeable performance difference in your code.

Smalltalk strings only have one escape, the ' given by '', so backslashes used in regular expression strings will be understood as backslashes, and a literal backslash can be given directly with \\, whereas it must be given as \\\\ in a literal Emacs Lisp string.

GST supports the regex modifiers imsx, as in Perl. You can't put regex modifiers like im after Smalltalk strings to specify them, because they aren't part of Smalltalk syntax. Instead, use the inline modifier syntax. For example, '(?is:abc.)' is equivalent to '[Aa][Bb][Cc](?:.|\n)'.

Using the regex messages

The methods on the compiled Regex object are private to this interface. As a public interface, GST provides methods on String, in the category `regex'. There are several methods for matching, replacing, pattern expansion, iterating over matches, and other useful things.

The fundamental operator is #searchRegex:, usually written as =~, reminiscent of Perl syntax. This method will always return a RegexResults, which you can query for whether the regex matched, the location Interval and contents of the match and any register groups as a collection, and other features. For example, here is a simple configuration file line parser:

| file config |
config := LookupTable new.
file := (File name: 'myapp.conf') readStream.
file linesDo: [:line |
    (line =~ '(\w+)\s*=\s*((?: ?\w+)+)') ifMatched: [:match |
        config at: (match at: 1) put: (match at: 2)]].
file close.
config printNl.

As with Perl, =~ will scan the entire string and answer the leftmost match if any is to be found, consuming as many characters as possible from that position. You can anchor the search with variant messages like #matchRegex:, or of course ^ and $ with their usual semantics if you prefer.

You shouldn't modify the string while you want a particular RegexResults object matched on it to remain valid, because changes to the matched text may propagate to the RegexResults object.

(currently "will", but best to leave open)

Analogously to the Perl s operator, GST provides #replacingRegex:with:. Unlike Perl, GST employs the pattern expansion syntax of the #% message here. For example, 'The ratio is 16/9.' replacingRegex: '(\d+)/(\d+)' with: '$%1\over%2$' answers 'The ratio is $16\over9$.'. In place of the g modifier, use the #replacingAllRegex:with: message instead.

One other interesting String message is #onOccurrencesOfRegex:do:, which invokes its do: argument, a block, on every successful match found in the receiver. Internally, every search will start at the end of the previous successful match. For example, this will print all the words in a stream:

stream contents onOccurrencesOfRegex: '\w+'
                do: [:each | each match printNl]

Notes on regex.c

This section is not intended for inclusion in the manual; it is here for notes I kept while researching these Missing manual pages.

  • The list of backslash metacharacters starts at line 2054, around "case '\\'". They are: sSdDwWbBAZzG (and <> when STRICT_PERL5, which I am fairly sure is not)
  • The handling of literal characters in the interpreter is at line 3927, around "case exactn:".
Syndicate content

User login