Supplementary plane error due to a character I'm not actually using explicitly in my regex.
See original GitHub issueI have the following regex in my grammar file:
[\t\n\r\u0020-\uD7FF\uE000\uFFFD\u10000-\u10FFFF]
And I get the following error due to the regex above:
Error: You have Unicode Supplementary Plane content in a regex set: JavaScript has severe problems with Supplementary Plane content, particularly in regexes, so you are kindly required to get rid of this stuff. Sorry! (Offending UCS-2 code which triggered this: 0xd800)
I know the regex is the source of the error because if I remove it, then everything is fine. Specifically, the problem is \u0020-\uD7FF.
Looking at the code in regexp-lexer.js
, I’ve deduced that the problem occurs when jison-gho
computes an inverted character set when it tries optimizing the regular expression. When it computes the inverted set, one of the range boundaries is 0xD7FF + 1
, and the error is triggered.
I can understand complaining about a user-written regular expression that goes into the supplementary plane, but here we’re talking about a regular expression that is computed behind the scenes. Should there even be an error raised on the inverted set which is computed internally?
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (6 by maintainers)
By the way @lddubeau : do note that you seem to use Supplementary Plane Unicode though, due to this bit of regex:
\u10000-\u10FFFF
, which then should be written as\u{10000}-\u{10FFFF}
(see also: https://rainsoft.io/what-every-javascript-developer-should-know-about-unicode/ )However, this is currently a moot point as JISON doesn’t support Astral Plane Unicode Codepoints (a.k.a. Supplementary Plane Characters, i.e. anything above U+FFFF). I’m looking into supporting ES2015 regex /u flag though, but that’s future noise.
For now I’ll first create a new patch release to mark the fixing of all the other bugs your issue report helped uncover! 👍
FYI: this will take a while to fix; I’ve got little spare time ATM and this needs some internal rework to work correctly for the entire Unicode range.