Introducing Regular Expressions (Part 1) - ' Getting Started ' (
Page 2 of 3 )
It's usually considered a complement to say that someone is a regular guy, but what about a regular expression? Yep, they're good too, way cool. OK, they're not really cool, but they are very useful! In a nutshell, a regular expression (sometimes called regexp or regex) is a template for identifying and manipulating text. While this description is accurate, it does not hint at the power and flexibility of regular expressions which are in many ways a programming language in their own right.
Most of us have at one time or another used wildcards when working with file names, and these provide a hint of what regular expressions are all about. In filename patters, the character * represents any sequence of 0 of more characters and the character ? represents any single character. Therefore, the template S*.DOC matches the file names SALES.DOC, STEVEN.DOC, S22.DOC, and so on, while the template B?G.TXT matches BIG.TXT, BOG.TXT, and BAG.TXT, among others, (but not BRAG.TXT), since the ? character only matches a single character. Although in regular expressions the * and ? characters mean something different from filename patterns, this is a faint hint of the power of regular expressions. This is the first in a 2-article series about regular expressions. Today I cover regular expressions themselves. their syntax and structure. Then in the second article I will provide some information on the various .NET Framework components that put the power of regular expressions to work.
ADVERTISEMENT
Regular expressions are certainly not new with .NET. They have been around since the 1950s and have been supported on UNIX platforms for decades. Windows was slow to incorporate support for regular expressions, but with the introduction of .NET, Microsoft has finally provided complete and robust regular expression support for developers..
Regular Expression Syntax
A regular expression is a string of one or more characters. Each character in the expression is one of two types:
An ordinary character, such as the letters of the alphabet or the digits 0-9, that matches itself.
A metacharacter that has a special meaning. Metacharacters are sometimes called quantifiers.
For example, the regular expression "so" contains two regular characters and will match the string "so" and nothing else. The * (asterisk) character (unlike with filename patterns) is a metacharacter and means "match the previous character 0 or more times." Thus, the regular expression "so*" will match "s", "so", "soo", "sooo", and so on. (The * applies only to the o, not the s.) Table 1 lists the metacharacters that can be used in regular expressions. I have omitted a few of the more obscure ones; you can find details in Visual Studio documentation.
Table 1. Regular expression metacharacters.
Metacharacter
Description
\
Indicates that the next character has special meaning (as will be discussed soon).
^
Matches the start of the input string or, for multi-line input strings, the start of a line. Thus, "^Microsoft" matches the text "Microsoft" only at the start of the string being searched.
$
Matches the end of the input string or, for multi-line input strings, the end of a line. Thus, "Microsoft&" matches the text "Microsoft" only at the end of the string being searched.
*
Matches the preceding character or subexpression 0 or more times.
+
Matches the preceding character or subexpression 1 or more times.
?
Matches the preceding character or subexpression 0 or 1 times.
{n}
Matches the preceding character or subexpression exactly n times. n must be a non-negative integer.
{n,}
Matches the preceding character or subexpression n or more times. n must be a non-negative integer.
{n,m}
Matches the preceding character or subexpression between n and m times, inclusive. n and m must be non-negative integers with n<=m.
. (period)
Matches any single character except a newline (\n). (This is just like the ? in filename patterns.)
[]
Used to enclose character classes (as described in the text).
()
Grouping. Use to define a part of a regular expression and then apply an operator, such as a repetition operator, to the entire group.
|
Or. When placed between two regular expressions, the resulting compound regular expression will match either of the constituent regular expressions.
Table 2 shows a few examples. These are fairly simple regular expressions, but are a good place to start..
Some examples.
.
Regular Expression
Examples that match
Examples that do not match
be{2}t
beet
bet, beeet
be(2,3}t
beet, beeet
bet, beeeet
be{2,}t
beet, beeet, beeeet, etc.
bet
^Net is great!
Net is great!
I like Net
(be{2}t)|(bo{2}t)
beet or boot
Anything else
(go){2,3}
gogo, gogogo
go, gogogogo
Carriage Returns and Other Odd Characters
Several of the characters the regular expressions can deal with may not be familiar to some readers. They date from the days when the only hardcopy output device was a teletype, an electric typewriter-like contraption. These control characters were related to the way teletypes work:
Newline (hexadecimal 0A) advances the paper one line. Sometimes called a linefeed.
Carriage return (hexadecimal 0D) moves the print head to the left edge of the paper.
Formfeed (hexadecimal 0C) advances the paper to the top of the next page.
Linefeed (LF) and carriage return (CR) are still used, while formfeed is pretty rare these days. Some confusion results from the fact that different platforms have different ways of marking the end of a line:
LF alone: Unix and Unix-derived platforms including Mac OS X.