Nested Constructs in Regular Expressions
May 15th, 2003 by gregr
Today I was looking for a way to do nested expression matching with regular expressions, and pretty much came up empty. Then after a trip to the bookstore to pick up Mastering Regular Expressions by Jeffrey Friedl, I finally found it.
Interestingly, even now that I know what to search for :-), I can’t find a single reference to this on the net or on MSDN.
With the .NET regular expression evaluator, there are (?<DEPTH>) and (?<-DEPTH>) constructs that you can use to match nested expressions; for example, if you want to find matching parentheses, or matching HTML tags. Here’s a “simple” example that will match nested <div> tags:
<div> (?><div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))</div>
Which will match the part in red below:
<div>this is some <div>red</div>text</div></div></div>
This is pretty cool, I’ve got to say. I really can’t do this justice; if you’re interested, I recommend you pick up the book!
This entry was posted on Thursday, May 15th, 2003 at 5:45 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

May 15th, 2003 at 6:51 pm
The only mention I found of balancing groups on MSDN is the brief one at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpcongroupingconstructs.asp
The better discussion is, as you mention, in Mastering Regex. The .NET chapter is available online actually, which is good because I’m fairly certain there are more than a few first editions out there on people’s bookshelves.
http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf
http://www.oreilly.com/catalog/regex2/chapter/index.html.
Good stuff, Greg.
May 15th, 2003 at 7:33 pm
Actually, that MSDN page does not cover the DEPTH construct, which was the whole point of what I was trying to do.
But I wish I knew that chapter was online – would have saved me an hour or so today going to and from the book store!
May 15th, 2003 at 7:41 pm
Mmm…I stand corrected; it does talk about it. :-)
May 18th, 2003 at 10:55 am
http://weitz.de/regex-coach
May 18th, 2003 at 3:16 pm
Unless regex-coach uses the .NET regular expression parser, it wouldn’t be of any use – the particular feature I was talking about is a MS extension.
November 18th, 2003 at 10:30 pm
Greg,
Expresso (www.ultrapico.com) does use .NET regular expressions, so you might want to give it a try. Thanks for the reference on balancing groups and Friedl’s book. There is no “DEPTH” construct, per se. That is simply the name that Friedl chose for his capture group, any other name works as well. The key concept is “balancing groups”.
December 20th, 2004 at 8:58 am
Any idea if this can be used to easily get the data between non nested items. For example, if I wanted to get all the data between a <div>
and </div>
tag and I know there is no nested <div>
tags… I tried this but it doesn’t work.
June 5th, 2005 at 11:08 pm
Thanks for the post- this is working like a champ! Exactly what I needed. I am having trouble getting this to work with multiple lines of html though. If there are any newlines between the tags, this seems to break.. am I doing something wrong? Thanks, Paul
February 4th, 2007 at 12:08 am
Note that you can do this even without the .NET-exclusive balancing group feature, as long as you know in advance the maximum levels of nesting you need to support. See http://badassery.blogspot.com/2006/03/regex-recursion-without-balancing.html for details.
December 13th, 2007 at 7:52 am
(Moderator, please remove my previous comment)
The code as posted here (and everywhere I’ve seen it) for matching balancing constructs does not work.
The problem lies in the meat of the pattern:
(?>
<div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))
)
let’s say input is: “Beforedivs<div>.stuff</div>more stuff</div>Afterdivs”
The pattern matches:<div>.stuff</div>more stuff</div>
That’s not a balanced set of divs. Here’s why: the <div> is matched and the DEPTH stack is pushed. The first </div> is matched and the stack is popped. Now the (?(DEPTH)(?!) comes in and tests if there’s anything on the stack and there isn’t so the pattern continues greedily. The second </div> is finally matched, the EMTPY stack is popped and again the test sees nothing on the stack so the pattern does not fail. The pattern is forced to backtrack and allow the last </div> in the pattern to match, and we get the erroneous result. The pattern does not account for unmatched </div>’s.
The good news is that solution is easy. Make it:
(?>
<div (?) | </div (?) | .? )*? (?(DEPTH)(?!)
)
I made the whole parentheses subexpression NON-GREEDY with the ? mark. Now the pattern stops after every <div>,</div>, or single ‘anything’ character (.?) matched and sees if it can match that last </div>, and if so it stops. This new pattern now matches:
<div>stuff</div>
sZweaver2112@gmail.com (86 the z)
December 14th, 2007 at 1:00 pm
Just to follow up what I said yesterday, having read the post at an msdn blog on the subject:It is best to think of a Group as a Stack of captures. Where the top of the stack is the last capture made. (?\)) Matches to “)” and pops a capture off of the Open group’s capture stack. This match can only be successful if and only if the Open group’s capture stack is not empty. This is a fancy way of saying that for every match of this group there must be a match of the group Open.So a “pop” *should* fail on an empty stack and make the pattern fail, but in my recent experience this was just not the case.
sZweaver@gmail.com (86 the z)
January 25th, 2008 at 4:14 pm
This post covers some interesting uses for balancing groups: Fun With .NET’s Regex Balancing Groups