Here's some quick code to parse individual tags and text elements out of an html string. It might be handy for some people, but it's also a good example of some advanced RegExp. Note that you could also do this by parsing it to XML, and traversing with E4X.
First, let's look at pulling out all of the tags:
var tags:Array = htmlText.match(/<[^<]+?>/g);
This code simply returns an array of substrings from the htmlText that match a simple regular expression. The regular expression matches any text that:
- < starts with <
- [^<]+? followed by one or more (+) characters that are not < ([^<]. This is a lazy, or non-greedy, repeat (?), which means it will find the minimum number of matching characters before matching the next element in the pattern.
- > ends with >.
Next, let's pull out the individual text elements:
var text:Array = input.htmlText.match(/(?<=^|>)[^><]+?(?=<|$)/g);This time the RegEx pattern is a bit more complex, incorporating positive forward and backward lookarounds. A lookaround allows you to search for something before or after your main pattern that you do not want included in the result.
- (?<=^|>) start with a positive lookbehind to match (but not return) either the beginning of the string or the end of tag (^|>).
- [^<]+? followed by a lazy search for one or more characters that are not <.
- (?=<|$) finish by using a lookahead to match (but not return) the beginning of the next tag, or the end of the string (<|$).
Here's a simple demo of the code in action:
Note: The empty entry in the text list is a space that is between the tags after "HTML", and the tags before "and".
You can download the Flash CS3 FLA for the above example by clicking here.
Comments (5)
Thanks! I've been looking for a quick way to do this. Nice job!
Posted by: Todd Perkins at March 13, 2008 02:31 PMURL: http://www.chadandtoddcast.com
Great job! I was always doing this using while(htmlText.indexOf())..that was a real mess :-)
Best regards.
Posted by: dogeroski at March 14, 2008 04:55 AMURL: http://www.deluxe.pl
Thank you Grant! I was looking exactly for that few times ago, but hadn't time to go further into Regex! Very useful, for example to get raw text from the Rich Text Editor of Flex...
I found this tool to simplify edition of regex codes, maybe do you already know about it:
Expresso by Ultrapico.
Best regards.
Posted by: Cedric M. (aka maddec) at March 18, 2008 12:11 PMURL: http://analogdesign.ch
Hey Grant! Thanks for the Code, helps me a lot to understand RegExp better. Greets, Daniel
Posted by: Daniel B. at March 19, 2008 02:34 AMURL: http://www.dbnetworx.de
Thanks.
I need a regex refresher. When I start to think of all the expressions my head hurts.
There is a tool called regexdesigner written in vb.net that is useful for building regex strings to parse or search for text that may be useful for some people.
Posted by: Eric at March 25, 2008 11:56 AMURL: