Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ttx] Non-ascii in CFF FullName string doesn't round-trip

See original GitHub issue

A font mentioned in #2897 uses non-ascii in the CFF FullName field. This is somehow dumped to .ttx with an unclear encoding, and reading the dumped .ttx fails on the non-ascii chars:

Compiling "/Users/just/Downloads/LeeSeoyun.ttx" to "/Users/just/Downloads/LeeSeoyun#1.otf"...
Parsing 'GlyphOrder' table...
Parsing 'head' table...
Parsing 'hhea' table...
Parsing 'maxp' table...
Parsing 'OS/2' table...
Parsing 'name' table...
Parsing 'cmap' table...
Parsing 'post' table...
Parsing 'CFF ' table...
ERROR: Unhandled exception has occurred
Traceback (most recent call last):
  File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 405, in main
    process(jobs, options)
  File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 387, in process
    action(input, output, options)
  File "/Users/just/code/git/fonttools/Lib/fontTools/misc/loggingTools.py", line 372, in wrapper
    return func(*args, **kwds)
  File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 298, in ttCompile
    ttf.importXML(input)
  File "/Users/just/code/git/fonttools/Lib/fontTools/ttLib/ttFont.py", line 349, in importXML
    reader.read()
  File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 47, in read
    self._parseFile(self.file)
  File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 72, in _parseFile
    parser.Parse(chunk, 0)
  File "/Users/sysadmin/build/v3.10.5/Modules/pyexpat.c", line 470, in EndElement
  File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 155, in _endElementHandler
    self.currentTable.fromXML(name, attrs, content, self.ttFont)
  File "/Users/just/code/git/fonttools/Lib/fontTools/ttLib/tables/C_F_F_.py", line 46, in fromXML
    self.cff.fromXML(name, attrs, content, otFont)
  File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 346, in fromXML
    topDict.fromXML(name, attrs, content)
  File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 2618, in fromXML
    value = conv.xmlRead(name, attrs, content, self)
  File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 1358, in xmlRead
    return tobytes(attrs["value"], encoding=("ascii"))
  File "/Users/just/code/git/fonttools/Lib/fontTools/misc/textTools.py", line 131, in tobytes
    return s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordinal not in range(128)

OTMaster somehow manages to read this field as a Korean string (which makes sense, given the font):

Yet fonttools reads it as gibberish: 'ì\x9d´ì\x84\x9cì\x9c¤ì²´'.

Issue Analytics

State:
Created 10 months ago
Comments:10 (2 by maintainers)

Top GitHub Comments

1reaction

schriftgestaltcommented, Nov 20, 2022

The current version should make sure that it contains ASCII only.

0reactions

justvanrossumcommented, Nov 21, 2022

but encoding a previously utf-8 string to latin1 could fail, or decoding a utf-8 string as latin-1 give you gibberish - how would that round-trip?

We start with a byte string without knowing the encoding. Converting this to Latin-1 can’t fail. Encoding the converted Latin-1 string as utf-8 can’t fail either. This works backwards as well. Yes, the XML will contain gibberish, but it does round-trip:

s = "".join(chr(i) for i in range(256))
b = s.encode("latin-1")
assert list(b) == list(range(256))
utf = s.encode("utf-8")
assert utf.decode("utf-8") == s

b2 = bytes(range(256))
s2 = b2.decode("latin-1")
assert s2 == s

Not saying this is a nice solution, but given there is no encoding information at all, and given any non-ascii in FullName can be considered broken, this at least would round-trip without error.

And I agree that if we can plug in that write8bit functionality here, it would be the better and cleaner solution.

Top Results From Across the Web

How can you strip non-ASCII characters from a string? (in C#)

It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u#### says which characters match.\ ...

mail.Address.String() does not take care of non-ASCII name ...

I expected mail.Address.String() can escape the comma for non-ASCII characters. Comma is output without escape and cause error mail: no angle- ...

Non-ASCII characters in RFCXML - IETF Tools

The use of non-ASCII characters in RFCXML is detailed in RFC 7997. ... (using the fullname, initials, and surname attributes, while the asciiFullname, ......

Encodings, Unabridged - Yehuda Katz

An encoding specifies how to take a list of characters (such as "hello") and persist them onto disk as a sequence of bytes....

RFC 6783: Mailing Lists and Non-ASCII Addresses

It outlines some possible scenarios for handling lists with mixtures of non-ASCII and traditional addresses but does not specify protocol changes or offer ......