Pyftsubset fails to include codepoints outside BMP
See original GitHub issueI seems pyftsubset
cannot handle characters that fall outside of Unicode’s Basic Multilingual Plane (BMP), and/or are encoded in above 21 bits (?), such as those in the Supplementary Multilingual Plane and other supplementary planes.
For example, I have some text containing 🕂 (U+1F542, Cross Pommee) and 🛒 (U+1F6D2, Shopping Trolley). Feeding the text, and a font that does contain the offending characters/glyphs to pyftsubset
pyftsubset BigFont.ttf --text="ABçdé🕂🛒" --layout-features+="*" --output-file=SubsettedFont.ttf
gets me a glyph-subsetted font, containing all the glyphs that map to characters in my source text (A, B, ç, d, é, …) except for 🕂 and 🛒, i.e. those that are not in the BMP. pyftsubset
is silent about this: it does not complain, the font is generated allright, the glyphs just won’t be there. You’ll notice this only when using the font…
Yet my source text is a valid UTF-8 string, which, as per pyftsubset --help
, should parse allright:
--text=<text>
Specify characters to include in the subset, as UTF-8 string.
--text-file=<path>
Like --text but reads from a file. Newline character are not added the the subset.
I suppose pyftsubset
is having difficulty with Unicode’s variable length encoding. When we run the above command with the --verbose
flag, then we get the following (i.a.) printed to the console:
Missing glyphs for requested Unicodes: ['U+DD42', 'U+DED2', 'U+D83D']
My guess: glyphs for those codepoints are missing from the input font indeed, because, obviously, such characters do not exist in Unicode. They are made up by pyftsubset
parsing the bitstream of two characters (🕂🛒) encoded in 21 bits each, into three codepoints of 7 bits each.
Am I missing something? Or, is support for characters out of the BMP on the roadmap?
What can I do to circumvent this defect?
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (10 by maintainers)
Top GitHub Comments
please let me know if #752 fixes the issue for you, thanks. If so, I’ll tag a new patch release.
Byte order mark is not a hack, it’s part of Unicode standard. Necessary to decode foreign UTF-16/UTF-32.