During the course of developing the Spelling Plus Library, and more recently while adding multilingual support to it, I discovered two serious bugs with the Regular Expression implementation in ActionScript, and how it handles accented characters.
First, RegExp in AS3 does not include accented characters in the word character class. For example, the pattern /\w+/ (match one or more word characters) matches "r" and "sume" in "résume", when it should match the full string. UPDATE: Arthur has pointed out in the comments that this is correct according to the ECMAScript and POSIX RegEx specifications. \w is intended to match just the set [a-zA-Z0-9_] , which it does in AS3. With that being understood, it would be nice to have support for unicode property sets (which allow you to match word characters in any language, among other things), but I can understand that this may have an unacceptable impact on the size of the Flash Player.
Secondly, there is a somewhat obscure problem with how the Flash player matches \S and accented characters. Specifically, it appears that it does not count accented characters properly when matching them to \S, and this results in weird results. This is not the case with the negated whitespace character set [^\s], although these sets should exhibit identical behaviour in RegEx. This issue is pretty weird, so I'll give a few examples:
- the pattern /\S+/ (one or more not-whitespace chars) will match the full string of "é aé", when it should match "é" and "aé" separately.
- the same pattern /\S+/ will match "aé" and "bé" correctly for the string "aé bé".
- the pattern /\S{2,}/ (two or more not-whitespace chars) will match the full string "aé bcé" when it should match "aé" and "bcé".
- the same pattern /\S{2,}/ will only match "bcé" for the string "éa bcé", when it should match "éa" and "bcé"
All of the above work properly if you substitute [^\s] for \S.
Hopefully this is helpful for other people working with RegExp, especially with languages other than English. It is quite frustrating to work around - I ended up writing a specialized character lexer instead of using RegExp in SPL.
Know of any other RegExp bugs in AS3? Share them in the comments.
Comments (9)
A wild guess as to the problem with \S matching too many characters: it has a problem with som cases of multi-byte character runs, which wouldn't be very surprising since regexps suck at non-ascii on all systems I've used them in.
The regexp engine in Firefox seems to handle all the \S+ cases (although it has the same basic problem of \w not matching accented characters).
Posted by: Theo at April 30, 2008 02:03 AMURL: http://blog.iconara.net
Theo,
Yes, this was my thought too. It's not counting the multi-byte character correctly in this case for some reason. Matching the trailing space is a little strange as well, but is likely related to the same problem. My guess would be the counting problem causes it to skip trying to match the space character completely.
Posted by: Grant Skinner at April 30, 2008 08:50 AMURL: http://gskinner.com/blog/
Hi Grant.
Regarding accented letters: while this is a bit counter intuitive, it's actually part of the ECMA 262 specs. The character class is just a shortcut for the a-z, A-Z, 0-9 ranges + "_" , which does not include accented letters.
Cheers
Arthur Debert
[1] The spec http://www.ecma-international.org/cgi-bin/counters/unicounter.pl?name=Ecma-262&deliver=http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf
Posted by: Arthur Debert at April 30, 2008 08:52 AM[2] The POSIX regex spec : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html
URL: http://www.stimuli.com.br/
Arthur,
Right you are - my bad. I guess the problem then is that AS3's RegExp implementation does not include support for any extended character classes (ex. unicode property sets), though I can understand that this may be due to file size implications in the player.
I'll update the article to reflect this.
Posted by: Grant Skinner at April 30, 2008 09:15 AMURL: http://gskinner.com/blog/
I´ve pointed this bug 1 year ago, but no one listen to me. I hope they listen you now!
Posted by: Marcos Neves at April 30, 2008 09:55 AMURL:
I live in México and since the first Flex sdk came out I realise about this bug. Today is a habitual practice to use more complicated RegExp to do something with spanish text.
Posted by: Quantium at April 30, 2008 11:03 AMURL: http://www.quantium.com.mx
I think I've found another regex bug:
Any idea how
/^(.*)-(.*)$/
doesn't find
aaaa - bbbb
Posted by: Nikos Katsikanis at October 9, 2008 02:29 AMURL: http://www.ecommercetotal.co.uk
Great list, it helps clear up much of the htacess mystery and confusion that comes from creating such files.
Posted by: clearance london at November 28, 2008 09:41 AMURL: http://www.wecleareverything.co.uk
Nikos - I just tested that pattern in RegExr, and it seems to work fine for me.
Posted by: Grant Skinner at December 5, 2008 10:13 AMURL: http://gskinner.com/blog/