Extracting and manipulation numbers from a text string

Tagged:  •    •    •    •  

Some years ago I came across a interesting text manipulation puzzle, on a c++ site
I think. I was learning Ruby at the time so had fun with implimenting in Ruby,
I decided to redo it in gst -- just for fun.

The problem: you have a string of mixed Alpha and Numeric characters, jumbled up,

'098m03r9f80239802389f0m9KDKL3KLJDKLJm0983m890DMOm003  dlkfj33hljf4h3klhl00000002998'.

We need to extract all the numerics, sort them into numeric order, but print them with leading zeros retained. Its not as obvious as it seems BTW. :-)

This apparently simple problem introduced me to a number of useful features in gst, and Smalltalk generally:

  1. Regex ( Regular Expressions) for character search and substitution.
  2. String Tokeniser to extract data from a text line.
  3. Collections - I've used a couple of different collections in the program, without which the code would be really complex. Collections are well worth understanding !!
  4. data displaying. I've kept a few 'inspect' commands in the program as they nicely show you collections in action.

Not bad for a simple program.

Below is the complete working program, below that I'll add comments to hopefully explain whats happening. Including output from gst inspect which neatly displays the contents of the collection and arrays used.

====================== working code ==========================
| txt nums |
nums := SortedCollection new.
dict  := Dictionary new.
nindx := 0.

Transcript cr; show: '*** hiddennumbers.ws '; show: Time now printString ; cr.
txt := '098m03r9f80239802389f0m9KDKL3KLJDKLJm0983m890DMOm003  dlkfj33hljf4h3klhl00000002998'.

Transcript show: 'Original text'; cr; show: txt ; cr.

txt := txt replacingAllRegex: '\D+'  with: ' '.                     " gst code "
                                  " replace all alphas with space "
Transcript show: 'Numerics Only'; cr; show: txt ; cr.

nums := txt tokenize: ' '.           " extract numeric (string) fields from txt"

Transcript show: '** Inspect nums'; cr.
nums inspect.
nums do: [ :indx | 
            nindx := indx asNumber.
                    dict at: nindx put: indx.].        
       " build a Dictionary key = numeric, value = string (retain leading zeros) "
Transcript show: '** Inspect dict'; cr.
dict inspect.

sortDict := dict keys asSortedCollection.  " build a sorted (numeric) key collection "

Transcript show: '** Inspect sortDict'; cr.
sortDict inspect.
Transcript cr; show: '** Final result:'; cr.
sortDict do: [ :akey | 
        Transcript show: akey printString.
                                ans := dict at: akey.
        Transcript show: ' <--> '; show: ans; cr.
         ].  " use the sorted keys to extract from the non sorted dictionary "

=======================================================================

I've used '//' as easy to find comment markers, they are not Smalltalk !

| txt nums |
nums := SortedCollection new.
dict  := Dictionary new.
nindx := 0.

// define two collections, and one variable, A Dictionary is a special form of a collection

Transcript cr; show: '*** hiddennumbers.ws '; show: Time now printString ; cr.
txt := '098m03r9f80239802389f0m9KDKL3KLJDKLJm0983m890DMOm003  dlkfj33hljf4h3klhl00000002998'.

Transcript show: 'Original text'; cr; show: txt ; cr.

// Display a heading and the text to be processed, text could be read from a file.

// *** hiddennumbers.ws 9:32:39
// Original text
// 098m03r9f80239802389f0m9KDKL3KLJDKLJm0983m890DMOm003  dlkfj33hljf4h3klhl00000002998


txt := txt replacingAllRegex: '\D+'  with: ' '.                     " gst code "
                                  " replace all alphas with space "

// using regex to replace all AlphaCharacters with a space

Transcript show: 'Numerics Only'; cr; show: txt ; cr.

// Numerics Only
// 098 03 9 80239802389 0 9 3 0983 890 003 33 4 3 00000002998


nums := txt tokenize: ' '.           " extract numeric (string) fields from txt"

// extract each number into an array, tokenize: scans txt placing each 'word'
// into its own array position in 'nums', in the order found in txt. Notice 
// that each item is a string and leading zeros are retained

Transcript show: '** Inspect nums'; cr.
nums inspect.

// ** Inspect nums
// An instance of Array
//  contents: [
//     [1]: '098'
//     [2]: '03'
//     [3]: '9'
//     [4]: '80239802389'
//     [5]: '0'
//     [6]: '9'
//     [7]: '3'
//     [8]: '0983'
//     [9]: '890'
//     [10]: '003'
//     [11]: '33'
//     [12]: '4'
//     [13]: '3'
//     [14]: '00000002998'
//   ]

nums do: [ :indx | 
            nindx := indx asNumber.
                    dict at: nindx put: indx.].        
       " build a Dictionary key = numeric, value = string (retain leading zeros) "

// looping though nums array, build a dictionary with the numeric value is the key
// and the text is the value , see below.
// Notice how duplicates are removed, this is because a Dictionary only allows unique
// keys, also leading zeros are removed by the asNumber.

Transcript show: '** Inspect dict'; cr.
dict inspect.

// ** Inspect dict
// An instance of Dictionary
//   tally: 10
//   contents: [
//     3->'3'
//     98->'098'
//     9->'9'
//     0->'0'
//     80239802389->'80239802389'
//     33->'33'
//     4->'4'
//     2998->'00000002998'
//     890->'890'
//     983->'0983'
//   ]

// We now have Dictionary of the unique entries, still in the natural order

sortDict := dict keys asSortedCollection.  " build a sorted (numeric) key collection "

// Create a sorted collection of the KEYS only

Transcript show: '** Inspect sortDict'; cr.
sortDict inspect.

// ** Inspect sortDict
// An instance of SortedCollection
//   firstIndex: 1
//   lastIndex: 10
//   lastOrdered: 10
//  sorted: true
//   sortBlock: a BlockClosure
//   contents: [
//     [1]: 0
//     [2]: 3
//     [3]: 4
//     [4]: 9
//     [5]: 33
//     [6]: 98
//     [7]: 890
//     [9]: 2998
//     [10]: 80239802389
//   ]

Transcript cr; show: '** Final result:'; cr.
sortDict do: [ :akey | 
        Transcript show: akey printString.
                                ans := dict at: akey.
        Transcript show: ' <--> '; show: ans; cr.
         ].  " use the sorted keys to extract from the non sorted dictionary "

// loop thru the sorted list (sortDict), and use its values as the key into 
//the Dictionay (dict)  that holds the original text.

// ** Final result:
// 0 <--> 0
// 3 <--> 3
// 4 <--> 4
// 9 <--> 9
// 33 <--> 33
// 98 <--> 098
// 890 <--> 890
// 983 <--> 0983
// 2998 <--> 00000002998
// 80239802389 <--> 80239802389
====================================================================

Hope you found this useful

Brett

I tried this on VisualWorks, here is a simplistic script that does not do duplicate removal:

 text := '098m03r9f80239802389f0m9KDKL3KLJDKLJm0983m890DMOm003  dlkfj33hljf4h3klhl00000002998'
 .
 numericStrings := text runsSatisfying: [:character | character isDigit]
 .
 sorted := (numericStrings collect: [:str | (Compiler evaluate: str) -> str]) asSortedCollection
 .
 sorted do: [ :pair | 
       Transcript cr; print: pair]

Note that the regex transform is not needed, the collection library can handle such a simple parse on its own :-)

Neat !, but a few questions;

1) I could not find 'runsSatisfying' when I tried a search while in the VWST Browser,

  the VWST search facilitie are not very good at finding 'stuff' !!
  -- where is it found please ?

2) and what exactly does

        (Compiler evaluate: str) -> str
  do? 

  I think that this might be a bit deep into Smalltalk for me to have come across it
  or actually seen its usefulness :-)

Here's how I'd do it with a regex but with simpler code, just for symmetry:

((x allOccurrencesOfRegex: '\d+')
   collect: [ :each | each asInteger->each ]) asSortedCollection

I read attentively your blog post attentively and i seen your post is very nice and good quality. www.naturalhealthplan.org

Thanks Paolo, 

I appreaciate both examples , I learnt quite a lot from both of
them, hope other people find the code comparison useful. 

I've updated my web page to add these excellent examples of 
smalltalk usage, just follow the links via 'smalltalk' then 
gnu-smalltalk for the example code

http://web.aanet.com.au/~dragoncity/

Why do you put a blank at the beginning of each line? :-) Only code lines need that so that they become pre tags.

I misunderstood your previous instructions re: formatting !
I thought every line need a leading space !
Whenever I do preview's on this blog they seem to get stuffed up, hence why I prefer using my web site as I can use Composer to control layout of the page :-)

I'll get it sorted out by gst version 5.0.5 !

User login