BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

Regular Expressions
.NET 1.1+

Regular Expression Grouping

The eighth part of the Regular Expressions in .NET tutorial examines grouping constructs and their use in the .NET regular expressions engine. Grouping allows a regular expression to include multiple subexpressions.

Grouping Constructs

In an earlier article in the regular expressions tutorial we matched groups of characters by surrounding them with parentheses. This allows you to combine a sequence of literals and pattern characters with a quantifier to find repeating or optional matches. In this article we'll see some further features of the grouping constructs and their use with the .NET regular expressions engine.

To begin, let's recap with an example program. The following code finds the anchor tags within some HTML code:

string input = "For more information use the "
             + "<a href='http://www.blackwasp.co.uk/Contact.aspx'>contact form</a> "
             + "or check the list of "
             + "<a href='http://www.blackwasp.co.uk/FAQ.aspx'>frequently "
             + "asked questions</a>.";

string pattern = "(<a href=')(.*?)('>)(.*?)(</a>)";

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Matched '{0}'", match.Value);
}

/* OUTPUT

Matched '<a href='http://www.blackwasp.co.uk/Contact.aspx'>contact form</a>'
Matched '<a href='http://www.blackwasp.co.uk/FAQ.aspx'>frequently asked questions</a>'

*/

The regular expression used in the above code is rather naive, as it will only find anchors that are formatted exactly as the ones in the input string. It uses five grouped subexpressions, each within parentheses, to perform the pattern matching. The subexpressions are:

SubexpressionPurpose
(a href=')Finds the literal text, "<a href='", identifying the starting part of an anchor tag, up to the starting position of the target URL.
(.*?)Matches a series of consecutive characters in a lazy manner. This group matches the URL defined within an anchor.
('>)Finds the two literal characters that close the anchor's opening tag.
(.*?)Matches another series of characters in a lazy manner. This group finds the text between the anchor's opening and closing tags. this is the text that would be displayed as a hyperlink in a web browser.
(</a>)Matches the end tag for the anchor.

In the sample, grouping constructs are used purely to match the overall pattern correctly. However, they do have many other uses. For example, you can extract the text matched by any of the subexpressions using .NET framework classes. You can also use the matched text from one subexpression within another, or perform search and replace functionality on the groups, which we'll see in future articles.

Obtaining Captured Groups

When you search for a regular expression that contains groups, each group's subexpression is matched and these results can be obtained individually. They are held in a collection of Group objects in the Groups property of the Match object. The Groups collection always contains at least one item at index zero. This is the full match. Any matched subexpressions follow from index one. They appear in the order of the groups in the pattern, and include any nested groups.

The following code performs the same matching as the first example. This time the output includes the text of the full match followed by two of the matched subexpressions. The first is the URL from the anchor, found at index 2. The second, at index 4, is the text that would be displayed as a hyperlink when the anchor is rendered in a web browser.

string input = "For more information use the "
                + "<a href='http://www.blackwasp.co.uk/Contact.aspx'>contact form</a> "
                + "or check the list of "
                + "<a href='http://www.blackwasp.co.uk/FAQ.aspx'>frequently "
                + "asked questions</a>.";

string pattern = "(<a href=')(.*?)('>)(.*?)(</a>)";

foreach (Match match in Regex.Matches(input, pattern))
{
    Console.WriteLine("Match: '{0}'", match.Value);
    Console.WriteLine("URL: '{0}'", match.Groups[2]);
    Console.WriteLine("Text: '{0}'", match.Groups[4]);
    Console.WriteLine();
}

/* OUTPUT

Match: '<a href='http://www.blackwasp.co.uk/Contact.aspx'>contact form</a>'
URL: 'http://www.blackwasp.co.uk/Contact.aspx'
Text: 'contact form'

Match: '<a href='http://www.blackwasp.co.uk/FAQ.aspx'>frequently asked questions</a>'
URL: 'http://www.blackwasp.co.uk/FAQ.aspx'
Text: 'frequently asked questions'

*/
10 October 2015