question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Full Unicode support, namely for codepoints outside the BMP

See original GitHub issue

Issue type

  • Bug Report: yes
  • Feature Request: kinda
  • Question: no
  • Not an issue: no

Prerequisites

  • Can you reproduce the issue?: yes
  • Did you search the repository issues?: yes
  • Did you check the forums?: yes
  • Did you perform a web search (google, yahoo, etc)?: yes

Description

JavaScript is, without some custom boilerplate, unable to properly deal with Unicode characters/codepoints outside the BMP, i.e., ones whose encoding requires more than 16 bits.

This limitation seems to carry over to PEG.js, as shown in the example below.

In particular, I’d like be be able to specify ranges such as [\u1D400-\u1D419] (which presently turns into [ᵀ0-ᵁ9]) or equivalently [𝐀-𝐙] (which throws an “Invalid character range” error). (And using the newish ES6 notation [\u{1D400}-\u{1D419}] results in the following error: SyntaxError: Expected "!", "$", "&", "(", ".", character class, comment, end of line, identifier, literal, or whitespace but "[" found..)

Might there be a way to make this work that does not require changes to PEG.js?

Steps to Reproduce

  1. Generate a parser from the grammar given below.
  2. Use it to try to parse something ostensibly-conforming.

Example code:

This grammar:

//MathUpper = [𝐀-𝐙]+
MathUpperEscaped = [\u1D400-\u1D419]+

Expected behavior:

The parser generated from the given grammar successfully parses, for example, “𝐀𝐁𝐂”.

Actual behavior:

A parse error: Line 1, column 1: Expected [ᵀ0-ᵁ9] but " (Or, when uncommenting the other rule, an “Invalid character range” error.)

Software

  • PEG.js: 0.10.0
  • Node.js: Not applicable.
  • NPM or Yarn: Not applicable.
  • Browser: All browsers I’ve tested.
  • OS: macOS Mojave.
  • Editor: All editors I’ve tested.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:7
  • Comments:15 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
STRd6commented, Feb 2, 2020

@StoneCypher I love the fire in your heart! But why give the current maintainer a hard time? No one is owed anything. Why not maintain your own fork?

1reaction
vsemozhetbytcommented, Feb 18, 2020

It seems the . (dot character) expression also needs Unicode mode. Compare:

const string = '-🐎-👱-';

const symbols = (string.match(/./gu));
console.log(JSON.stringify(symbols, null, '  '));

const pegResult = require('pegjs/')
                 .generate('root = .+')
                 .parse(string);
console.log(JSON.stringify(pegResult, null, '  '));

Output:

[
  "-",
  "🐎",
  "-",
  "👱",
  "-"
]
[
  "-",
  "\ud83d",
  "\udc0e",
  "-",
  "\ud83d",
  "\udc71",
  "-"
]
Read more comments on GitHub >

github_iconTop Results From Across the Web

2022 Top Ten List: Why Support Beyond-BMP Code Points?
Beyond-BMP code points refer to code points that are outside the BMP (Basic Multilingual Plane) of the Unicode Standard, specifically Planes 1 through...
Read more >
Glossary of Unicode Terms
A Unicode encoded character having a BMP code point. ... defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard....
Read more >
Unicode programming, with examples - begriffs.com
Every Unicode string is expressed as a list of codepoints. ... Both a small code block in the BMP, as well as two...
Read more >
JavaScript's internal character encoding: UCS-2 or UTF-16?
UTF-16 (16-bit Unicode Transformation Format ) is an extension of UCS-2 that allows representing code points outside the BMP.
Read more >
Surrogate pairs not treated as single unicode codepoint for ...
unicode codepoints outside of the BMP (base multilingual plane), i.e., ... A full CJK support requires support for non-BMP characters.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found