cmap format 2: single byte of 1-byte character vs first byte of 2-byte characters
See original GitHub issueI found a difference in subHeaderKeys[]
of cmap subtable format 2, while modifying a font originally generated from makeotf. This subtable is for (legacy) MacJapanese, and built from 83pv-RKSJ-H CMap by makeotf. In the original font, subHeaderKeys[0xEF]
to subHeaderKeys[0xFC]
are all 376; while they are all 0 in the modified font. Please see my gist for details:
According to the OpenType spec, subHeaderKeys[]
values follow this rule:
- If
subHeaderKeys[0xhh] == 0
,0xhh
is a single byte of 1-byte character code - If
subHeaderKeys[0xhh] > 0
,0xhh
is a first byte of 2-byte character codes
Since0xEF
to 0xFC
are first bytes in MacJapanese, we can say the original font (makeotf) is correct and the modified font (FontTools) is wrong.
Currently, FontTools and TTX file don’t hold these subHeaderKeys[]
data. So when there is no glyph mappings for 1-byte code 0xhh
and for 2-byte codes 0xhh??
, it is impossible to tell 0xhh
is a single byte or a first byte.
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
I think there is not an issue here. I do see that starting with a cmap format 2 table made by makeotf (LogoCutStd-Bold.otf) dumping it with ttx, and then recompiling to ttx, the subHeaderKeys array comes out differently. However, the cmap format 2 allows different choices to be made about how to break up the subarrays in the glyphIndexArray, that are still functionally equivalent. Different segment and subarray ordering results in different values in the subHeaderKeys array. Using both ttx and ‘spot’, I see that the charcode to glyph mapping is identical for both fonts. I might have argued for changing fonttools to use the same segment and subarray ordering as makeotf for consistency, but the cmap subtable produced by fonttools is smaller that that produced by makeotf, so I would vote for changing makeotf. As a separate issue, one of mashabows’ concerns is that it is not possible to tell a one byte value from the start of a two byte value. This would be a problem if the ttx compiler had to decode a the glyph encoding from a charstring that encoded multiple glyphs, but the encoding elements of a ttx cmap format 2 table give the entire code point for a single glyph.
'map code="0xed40" name="cid00002"'
is an entry for a two byte code,'map code="0xed" name="cid000001"'
is an entry for a one-byte code. There is no ambiguity.Happy to help. Sorry about not noticing that quoted XML elements disappear in the posts - I fixed this, and now the last two sentences should make sense.