BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

Regular Expressions
.NET 1.1+

Regular Expression Character Classes

The fourth part of the Regular Expressions in .NET tutorial continues to look at the characters used within a regular expression. This article describes characters classes, which allow the creation of patterns that are not restricted to matching only literal characters.

Character Ranges

Character groups become difficult to understand when they contain too many acceptable characters. For example, if you wanted to be able to match any lower case letter, the character group, including the brackets, would be 28 characters long. In situations where the characters to match are contiguous, you can use a character range instead.

With a character range you specify the first and last acceptable characters within the brackets, separating the two with a hyphen. For example, to match any lower case letter you could use "[a-z]". For numeric digits, you would include, "[0-9]".

The following example extracts telephone numbers from a source string using character ranges. The numbers must be formatted into a group of five digits, followed by a space and six further digits.

string input = "Bob: 01234 567890, Mel: 01234 987654, Sue: 01432 123001";
string pattern = "[0-9][0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9]";

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}

/* OUTPUT
  
Matched '01234 567890' at index 5
Matched '01234 987654' at index 24
Matched '01432 123001' at index 43
            
*/

NB: As with character groups, character ranges can be negated using the caret symbol. To match any non-numeric character you could use, "[^0-9]".

Combining Character Classes

Character ranges can be combined with character groups to match a wider range of symbols. To combine them, you include all possible options within the brackets. For example, to match any alphanumeric character you can combine the lower and upper case letter ranges with a digit range as, "[a-zA-Z0-9]. If you also wanted to match a full stop or comma, you can add these as individual characters with, "[a-zA-Z0-9.,]".

To demonstrate, try the code below. This extracts the prices from a list of optional extras for a car. The regular expression used is somewhat over-complicated. It could be both simplified and enhanced using techniques that we have not yet seen in the tutorial.

string input = @"Option List

Metallic Paint        250.00
Alloy Wheels        1,050.00
Rear Spoiler          325.00
Leather Interior    2,350.00
Tinted Windows         99.00";

string pattern = "[0-9 ][, ][0-9 ][0-9][0-9].[0-9][0-9]";

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}

/* OUTPUT
  
Matched '  250.00' at index 35
Matched '1,050.00' at index 65
Matched '  325.00' at index 95
Matched '2,350.00' at index 125
Matched '   99.00' at index 155
            
*/

Shorthand Character Classes

Some common character classes are used so often that the regular expression language includes shorthand versions. These can make the patterns easier to read and comprehend quickly. There are six key shorthand options available. They are:

  • \d. Matches any numeric digit. This code is shorthand for "[0-9]".
  • \D. The negated version of \d. Matches any non-numeric character. This code is shorthand for "[^0-9]".
  • \w. Matches any word character. A word character is any letter, numeric digit or underscore character. This code is shorthand for "[a-zA-Z0-9_]".
  • \W. The negated version of \w. Matches any non-word character. This code is shorthand for "[^a-zA-Z0-9_]".
  • \s. Matches any white space character, including spaces, tabs, carriage returns and line feeds.
  • \S. The negated version of \s. Matches any non-white space character.

The final sample code recreates the telephone number example from earlier in the article. This time the code uses shorthand instead of character ranges.

string input = "Bob: 01234 567890, Mel: 01234 987654, Sue: 01432 123001";

foreach (Match match in Regex.Matches(input, @"\d\d\d\d\d \d\d\d\d\d\d"))
{
    Console.WriteLine("Matched '{0}' at index {1}", match.Value, match.Index);
}

/* OUTPUT
  
Matched '01234 567890' at index 5
Matched '01234 987654' at index 24
Matched '01432 123001' at index 43
            
*/
13 September 2015