Full Unicode support, namely for codepoints outside the BMP
See original GitHub issueIssue type
- Bug Report: yes
- Feature Request: kinda
- Question: no
- Not an issue: no
Prerequisites
- Can you reproduce the issue?: yes
- Did you search the repository issues?: yes
- Did you check the forums?: yes
- Did you perform a web search (google, yahoo, etc)?: yes
Description
JavaScript is, without some custom boilerplate, unable to properly deal with Unicode characters/codepoints outside the BMP, i.e., ones whose encoding requires more than 16 bits.
This limitation seems to carry over to PEG.js, as shown in the example below.
In particular, I’d like be be able to specify ranges such as [\u1D400-\u1D419] (which presently turns into [ᵀ0-ᵁ9]) or equivalently [𝐀-𝐙] (which throws an “Invalid character range” error). (And using the newish ES6 notation [\u{1D400}-\u{1D419}] results in the following error: SyntaxError: Expected "!", "$", "&", "(", ".", character class, comment, end of line, identifier, literal, or whitespace but "[" found..)
Might there be a way to make this work that does not require changes to PEG.js?
Steps to Reproduce
- Generate a parser from the grammar given below.
- Use it to try to parse something ostensibly-conforming.
Example code:
This grammar:
//MathUpper = [𝐀-𝐙]+
MathUpperEscaped = [\u1D400-\u1D419]+
Expected behavior:
The parser generated from the given grammar successfully parses, for example, “𝐀𝐁𝐂”.
Actual behavior:
A parse error: Line 1, column 1: Expected [ᵀ0-ᵁ9] but " (Or, when uncommenting the other rule, an “Invalid character range” error.)
Software
- PEG.js: 0.10.0
- Node.js: Not applicable.
- NPM or Yarn: Not applicable.
- Browser: All browsers I’ve tested.
- OS: macOS Mojave.
- Editor: All editors I’ve tested.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:7
- Comments:15 (3 by maintainers)

Top Related StackOverflow Question
@StoneCypher I love the fire in your heart! But why give the current maintainer a hard time? No one is owed anything. Why not maintain your own fork?
It seems the
. (dot character)expression also needs Unicode mode. Compare:Output: