Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Full Unicode support, namely for codepoints outside the BMP

See original GitHub issue

Issue type

Bug Report: yes
Feature Request: kinda
Question: no
Not an issue: no

Prerequisites

Can you reproduce the issue?: yes
Did you search the repository issues?: yes
Did you check the forums?: yes
Did you perform a web search (google, yahoo, etc)?: yes

Description

JavaScript is, without some custom boilerplate, unable to properly deal with Unicode characters/codepoints outside the BMP, i.e., ones whose encoding requires more than 16 bits.

This limitation seems to carry over to PEG.js, as shown in the example below.

In particular, I’d like be be able to specify ranges such as [\u1D400-\u1D419] (which presently turns into [ᵀ0-ᵁ9]) or equivalently [𝐀-𝐙] (which throws an “Invalid character range” error). (And using the newish ES6 notation [\u{1D400}-\u{1D419}] results in the following error: SyntaxError: Expected "!", "$", "&", "(", ".", character class, comment, end of line, identifier, literal, or whitespace but "[" found..)

Might there be a way to make this work that does not require changes to PEG.js?

Steps to Reproduce

Generate a parser from the grammar given below.
Use it to try to parse something ostensibly-conforming.

Example code:

This grammar:

//MathUpper = [𝐀-𝐙]+
MathUpperEscaped = [\u1D400-\u1D419]+

Expected behavior:

The parser generated from the given grammar successfully parses, for example, “𝐀𝐁𝐂”.

Actual behavior:

A parse error: Line 1, column 1: Expected [ᵀ0-ᵁ9] but " (Or, when uncommenting the other rule, an “Invalid character range” error.)

Software

PEG.js: 0.10.0
Node.js: Not applicable.
NPM or Yarn: Not applicable.
Browser: All browsers I’ve tested.
OS: macOS Mojave.
Editor: All editors I’ve tested.

Issue Analytics

State:
Created 5 years ago
Reactions:7
Comments:15 (3 by maintainers)

Top GitHub Comments

2reactions

STRd6commented, Feb 2, 2020

@StoneCypher I love the fire in your heart! But why give the current maintainer a hard time? No one is owed anything. Why not maintain your own fork?

1reaction

vsemozhetbytcommented, Feb 18, 2020

It seems the . (dot character) expression also needs Unicode mode. Compare:

const string = '-🐎-👱-';

const symbols = (string.match(/./gu));
console.log(JSON.stringify(symbols, null, '  '));

const pegResult = require('pegjs/')
                 .generate('root = .+')
                 .parse(string);
console.log(JSON.stringify(pegResult, null, '  '));

Output:

[
  "-",
  "🐎",
  "-",
  "👱",
  "-"
]

[
  "-",
  "\ud83d",
  "\udc0e",
  "-",
  "\ud83d",
  "\udc71",
  "-"
]

Top Results From Across the Web

2022 Top Ten List: Why Support Beyond-BMP Code Points?

Beyond-BMP code points refer to code points that are outside the BMP (Basic Multilingual Plane) of the Unicode Standard, specifically Planes 1 through...

Glossary of Unicode Terms

A Unicode encoded character having a BMP code point. ... defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard....

Unicode programming, with examples - begriffs.com

Every Unicode string is expressed as a list of codepoints. ... Both a small code block in the BMP, as well as two...

JavaScript's internal character encoding: UCS-2 or UTF-16?

UTF-16 (16-bit Unicode Transformation Format ) is an extension of UCS-2 that allows representing code points outside the BMP.

Surrogate pairs not treated as single unicode codepoint for ...

unicode codepoints outside of the BMP (base multilingual plane), i.e., ... A full CJK support requires support for non-BMP characters.