question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyftsubset fails to include codepoints outside BMP

See original GitHub issue

I seems pyftsubset cannot handle characters that fall outside of Unicode’s Basic Multilingual Plane (BMP), and/or are encoded in above 21 bits (?), such as those in the Supplementary Multilingual Plane and other supplementary planes.

For example, I have some text containing 🕂 (U+1F542, Cross Pommee) and 🛒 (U+1F6D2, Shopping Trolley). Feeding the text, and a font that does contain the offending characters/glyphs to pyftsubset

pyftsubset BigFont.ttf --text="ABçdé🕂🛒" --layout-features+="*" --output-file=SubsettedFont.ttf

gets me a glyph-subsetted font, containing all the glyphs that map to characters in my source text (A, B, ç, d, é, …) except for 🕂 and 🛒, i.e. those that are not in the BMP. pyftsubset is silent about this: it does not complain, the font is generated allright, the glyphs just won’t be there. You’ll notice this only when using the font…

Yet my source text is a valid UTF-8 string, which, as per pyftsubset --help, should parse allright:

--text=<text>
    Specify characters to include in the subset, as UTF-8 string.
--text-file=<path>
    Like --text but reads from a file. Newline character are not added the the subset.

I suppose pyftsubset is having difficulty with Unicode’s variable length encoding. When we run the above command with the --verbose flag, then we get the following (i.a.) printed to the console:

Missing glyphs for requested Unicodes: ['U+DD42', 'U+DED2', 'U+D83D']

My guess: glyphs for those codepoints are missing from the input font indeed, because, obviously, such characters do not exist in Unicode. They are made up by pyftsubset parsing the bitstream of two characters (🕂🛒) encoded in 21 bits each, into three codepoints of 7 bits each.

Am I missing something? Or, is support for characters out of the BMP on the roadmap?

What can I do to circumvent this defect?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:15 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
anthrotypecommented, Nov 29, 2016

please let me know if #752 fixes the issue for you, thanks. If so, I’ll tag a new patch release.

1reaction
adrientetarcommented, Nov 29, 2016

BOM being bit-order-mark — the common Unicode encoding hack

Byte order mark is not a hack, it’s part of Unicode standard. Necessary to decode foreign UTF-16/UTF-32.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating font subsets - Dev Diary
To create subsets, you can use online tools like the web font generator from Font Squirrel, but here I'll show you how to...
Read more >
subset — fontTools Documentation - Read the Docs
pyftsubset is an OpenType font subsetter and optimizer, based on fontTools. ... Do not fail if some requested Unicode characters (including those indirectly ......
Read more >
fonttools - PyPI
The project includes the TTX tool, that can convert TrueType and OpenType fonts ... The fontTools package currently has no (required) external dependencies ......
Read more >
Status of pyftmerge - Google Groups
The pyftmerge executable, OTOH, seems to have no man page that I can find, ... and my resultant font has missing code points...
Read more >
JavaScript strings outside of the BMP - unicode - Stack Overflow
Depends what you mean by 'support'. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found