Nb. this page is not yet complete...but still informative

Regular expressions (short: regexp) are a powerful tool used by programmers to parse textual data in arbitrary ways. Wikipedia has an article describing their general use value, history, etc, in more depth. For a completely exhaustive guide, check out www.regular-expressions.info.

This is intended as an introductory guide for non-programmers, who may have the opportunity to make use of regular expression syntax in text searching applications. One such example would be the unix command line utility grep (for "global regular expression print"). Be aware, the syntax used here is actually PCRE, for "Perl Compatible" regular expressions, which is slightly newer and more widespread (especially in web applications) than the POSIX syntax used by grep (it differs most notably in the way character classes are indicated).

Here's an example of how a simple regexp can be used to filter out unwanted data in searches: let's say you want to find pages which contain the word "car", but not "cartoon" "incarnation", etc.

\bcars?\b

This will only match car (or cars) as a discrete word. Some big powerful search engines like google have their own (perhaps imperfect) ways of making such distinctions automatically — but they do not allow for the flexibility of regular expressions.

\b indicates a word boundary. What is a word boundary? Anything non-alphanumeric, such as whitespace and punctuation. So you will also get hits for "car-boat" (car-boats have been known to make occasional appearances in old James Bond movies). If you left the second \b off, you'd get carping but not incarcerate. The ? indicates the previous character may or may not be present.

? can be useful if you are uncertain of how to spell a proper noun, or suspect a word might be misspelled in the data you are searching. For example:

Joh?n

Would match Jon or John. A couple of other useful, related special characters used in regexp syntax are + and *. + indicates the previous character may occur more than once. For example:

Svens+on

Would match Svenson and Svensson (and "Svensssson"). It will not, however, match Svenon, with no 's'. For that you need *:

ka*boo+m!

Will match kaaaaboom! and kboooom!. *,+ and ? are called quantifiers. Quantifiers can also be applied to groups of characters if you enclose them in parentheses:

Sven(gali)?\b?

Will match Sven and Svengali but not "Svensson". Another thing you can do here is to use |, which means "or":

\bsail(ed|ing)?\b

Matches sail, sailing and sailed, but not "sailboat". | can be used outside of parentheses:

Sven|Ralph

Matches — you guessed it — Sven or Ralph. So here's a way you could find "Svengali Ralph Svensson" if you want to be sure both the nickname and first name are present, but are not sure what order they might be in:

(Svengali|Ralph) (Svengali|Ralph) Svensson

Also matches Ralph Svengali Svensson. There is a slighter easier way to do that:

((Svengali|Ralph) ){2}

{2} means match the previous character or set exactly twice. Notice, this is not the same as:

(Svengali|Ralph){2}

Which will match SvengaliRalph, but will not match Svengali Ralph. {2,5} will match something between two and five times. {2,} matches at least (as opposed to exactly) twice. It's actually better to not use plain spaces in regexps, because a plain space does not match other characters which appear as whitespace such as tabs and "newlines" (line feeds). This is especially true in web documents, which treat newlines as spaces. Another issue which is especially true with web documents is that a single space — like in the above regexp — will not match if there are actually several spaces in the source. This is "especially true" on the web because browsers flatten whitespace, so there is less concern with it when documents are prepared. For example, if you look at the "page source" for this, you will notice there are two newlines and numbers of spaces and tabs between these two words: here there. Applications like search engines will probably account for this, but if you can use regexps and you want to be certain, use \s+ to indicate whitespace. \s matches any kind of whitespace, + again means one or more.

((Svengali|Ralph)\s+){2}