.NET 1.1+Regular Expression Character Classes
The fourth part of the Regular Expressions in .NET tutorial continues to look at the characters used within a regular expression. This article describes characters classes, which allow the creation of patterns that are not restricted to matching only literal characters.
Character Ranges
Character groups become difficult to understand when they contain too many acceptable characters. For example, if you wanted to be able to match any lower case letter, the character group, including the brackets, would be 28 characters long. In situations where the characters to match are contiguous, you can use a character range instead.
With a character range you specify the first and last acceptable characters within the brackets, separating the two with a hyphen. For example, to match any lower case letter you could use "[a-z]". For numeric digits, you would include, "[0-9]".
The following example extracts telephone numbers from a source string using character ranges. The numbers must be formatted into a group of five digits, followed by a space and six further digits.
string input = "Bob: 01234 567890, Mel: 01234 987654, Sue: 01432 123001";
string pattern = "[0-9][0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9]";
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}
/* OUTPUT
Matched '01234 567890' at index 5
Matched '01234 987654' at index 24
Matched '01432 123001' at index 43
*/
NB: As with character groups, character ranges can be negated using the caret symbol. To match any non-numeric character you could use, "[^0-9]".
Combining Character Classes
Character ranges can be combined with character groups to match a wider range of symbols. To combine them, you include all possible options within the brackets. For example, to match any alphanumeric character you can combine the lower and upper case letter ranges with a digit range as, "[a-zA-Z0-9]. If you also wanted to match a full stop or comma, you can add these as individual characters with, "[a-zA-Z0-9.,]".
To demonstrate, try the code below. This extracts the prices from a list of optional extras for a car. The regular expression used is somewhat over-complicated. It could be both simplified and enhanced using techniques that we have not yet seen in the tutorial.
string input = @"Option List
Metallic Paint 250.00
Alloy Wheels 1,050.00
Rear Spoiler 325.00
Leather Interior 2,350.00
Tinted Windows 99.00";
string pattern = "[0-9 ][, ][0-9 ][0-9][0-9].[0-9][0-9]";
foreach (Match match in Regex.Matches(input, pattern))
{
Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}
/* OUTPUT
Matched ' 250.00' at index 35
Matched '1,050.00' at index 65
Matched ' 325.00' at index 95
Matched '2,350.00' at index 125
Matched ' 99.00' at index 155
*/
Shorthand Character Classes
Some common character classes are used so often that the regular expression language includes shorthand versions. These can make the patterns easier to read and comprehend quickly. There are six key shorthand options available. They are:
- \d. Matches any numeric digit. This code is shorthand for "[0-9]".
- \D. The negated version of \d. Matches any non-numeric character. This code is shorthand for "[^0-9]".
- \w. Matches any word character. A word character is any letter, numeric digit or underscore character. This code is shorthand for "[a-zA-Z0-9_]".
- \W. The negated version of \w. Matches any non-word character. This code is shorthand for "[^a-zA-Z0-9_]".
- \s. Matches any white space character, including spaces, tabs, carriage returns and line feeds.
- \S. The negated version of \s. Matches any non-white space character.
The final sample code recreates the telephone number example from earlier in the article. This time the code uses shorthand instead of character ranges.
string input = "Bob: 01234 567890, Mel: 01234 987654, Sue: 01432 123001";
foreach (Match match in Regex.Matches(input, @"\d\d\d\d\d \d\d\d\d\d\d"))
{
Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}
/* OUTPUT
Matched '01234 567890' at index 5
Matched '01234 987654' at index 24
Matched '01432 123001' at index 43
*/
13 September 2015