Jump to: navigation, search

Details on regular expressions as a data source

Revision as of 10:05, 17 August 2007 by S.bejakovic.gmail.com (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The message batch generator can populate fields by creating text strings that match regular expressions. A regular expression is a compact syntax for describing a certain set of strings. For example, the regular expression cat|dog describes the two strings cat and dog, while the regular expression cats? describes the two strings cat and cats. Here are some examples of the kinds of strings you can generate from regular expressions:

>cat
cat cat cat cat cat 
>cat|dog
cat dog cat dog cat 
>cats?
cat cats cat cats cat 
>'grrr*'
'grr' 'grrr' 'grrrr' 'grrrrr' 'grrrrrr' 
>(mewl)+
mewl mewlmewl mewlmewlmewl mewlmewlmewlmewl mewlmewlmewlmewlmewl 
>hot{3,4}
hottt hotttt hottt hotttt hottt 
>[a-z]
a b c d e 
>[0-4]
0 1 2 3 4 
>[ac-e]
a c d e a 

There is a preference page accessed through Window->Preferences->OHF H3ET->Batch Generator->Regex Batch Data Source, that sets some required options:


RegexPreferences.PNG


The 'Regex choice strategy' options determine in which order strings are generated from the regular expression; that is, how the generator will behave when it encounters a choice in a regular expression, such as alternation ('|') or a quantifier (such as '*' or {2,3}). If 'Random' is selected, then the choice will be made randomly. If 'Increasing' is selected, then the first time a string is generated from the regular expression, the first available option path will be taken, the second time the second, and so on. When it reaches the last choice it starts over again. 'Decreasing' works like 'Increasing' but starting at the last available option and working backwards. Here is a sample of the different behaviors using the same regular expression to generate 5 strings:

Random:
>a|b|c|d
b d a c d

Increasing:
>a|b|c|d
a b c d a 

Decreasing:
>a|b|c|d
d c b a d

The second option on the preference page, 'Upper bound for infinite closures' puts an upper limit on the size of the strings created by using quantifiers like '*' or '+', since they could potentially generate arbitrarily large strings. For example, an upper bound of 3 would mean that the longest string which could be generated from a regular expression like ab* would be abbb.

The third option on the preference page, "Alphabet character class", allows the user to set the default alphabet. This comes in when using the any-character operator '.' and when using negated character classes such as [^abc]. In usual regular expression usage, '.' represents any character. However, if regular expressions are used to generate strings, then not all characters are necessarily desirable (such as space or punctuation characters, or characters from other alphabets which cannot be properly displayed). You can therefore either specify a character class (without the []) which will be accessed with the '.' shortcut, or you can leave this field blank, in which case all Unicode characters will be used as the alphabet.

Finally, the message batch generator requires that if you are using regular expressions as a source of data, then the total number of files must be limited (that is, on the final page of the wizard, we have to select a maximum number of files of create, instead of opting to use the entire data source).


Reference

The generator supports the following special operators for generating sample strings:

Expression Description
. any character
() groups the expressions inside the parentheses
+ 0 or 1 of the preceding expression
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
{n} exactly n of the preceding expression
{n,} n or more of the preceding expression
{n,m} between n and m of the preceding expression (n must not be greater than m)
[xyz] any character inside the brackets
[a-n] any character in the range
[^a-n123] any character not in the range, and not 1, 2, or 3
\ treat whatever comes next as an ordinary character, and not as a special operator