Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyftsubset fails to include codepoints outside BMP

See original GitHub issue

I seems pyftsubset cannot handle characters that fall outside of Unicode’s Basic Multilingual Plane (BMP), and/or are encoded in above 21 bits (?), such as those in the Supplementary Multilingual Plane and other supplementary planes.

For example, I have some text containing 🕂 (U+1F542, Cross Pommee) and 🛒 (U+1F6D2, Shopping Trolley). Feeding the text, and a font that does contain the offending characters/glyphs to pyftsubset

pyftsubset BigFont.ttf --text="ABçdé🕂🛒" --layout-features+="*" --output-file=SubsettedFont.ttf

gets me a glyph-subsetted font, containing all the glyphs that map to characters in my source text (A, B, ç, d, é, …) except for 🕂 and 🛒, i.e. those that are not in the BMP. pyftsubset is silent about this: it does not complain, the font is generated allright, the glyphs just won’t be there. You’ll notice this only when using the font…

Yet my source text is a valid UTF-8 string, which, as per pyftsubset --help, should parse allright:

--text=<text>
    Specify characters to include in the subset, as UTF-8 string.
--text-file=<path>
    Like --text but reads from a file. Newline character are not added the the subset.

I suppose pyftsubset is having difficulty with Unicode’s variable length encoding. When we run the above command with the --verbose flag, then we get the following (i.a.) printed to the console:

Missing glyphs for requested Unicodes: ['U+DD42', 'U+DED2', 'U+D83D']

My guess: glyphs for those codepoints are missing from the input font indeed, because, obviously, such characters do not exist in Unicode. They are made up by pyftsubset parsing the bitstream of two characters (🕂🛒) encoded in 21 bits each, into three codepoints of 7 bits each.

Am I missing something? Or, is support for characters out of the BMP on the roadmap?

What can I do to circumvent this defect?

Issue Analytics

State:
Created 7 years ago
Comments:15 (10 by maintainers)

Top GitHub Comments

1reaction

anthrotypecommented, Nov 29, 2016

please let me know if #752 fixes the issue for you, thanks. If so, I’ll tag a new patch release.

1reaction

adrientetarcommented, Nov 29, 2016

BOM being bit-order-mark — the common Unicode encoding hack

Byte order mark is not a hack, it’s part of Unicode standard. Necessary to decode foreign UTF-16/UTF-32.

Top Results From Across the Web

Creating font subsets - Dev Diary

To create subsets, you can use online tools like the web font generator from Font Squirrel, but here I'll show you how to...

subset — fontTools Documentation - Read the Docs

pyftsubset is an OpenType font subsetter and optimizer, based on fontTools. ... Do not fail if some requested Unicode characters (including those indirectly ......

fonttools - PyPI

The project includes the TTX tool, that can convert TrueType and OpenType fonts ... The fontTools package currently has no (required) external dependencies ......

Status of pyftmerge - Google Groups

The pyftmerge executable, OTOH, seems to have no man page that I can find, ... and my resultant font has missing code points...

JavaScript strings outside of the BMP - unicode - Stack Overflow

Depends what you mean by 'support'. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display ...