[ttx] Non-ascii in CFF FullName string doesn't round-trip
See original GitHub issueA font mentioned in #2897 uses non-ascii in the CFF FullName
field. This is somehow dumped to .ttx with an unclear encoding, and reading the dumped .ttx fails on the non-ascii chars:
Compiling "/Users/just/Downloads/LeeSeoyun.ttx" to "/Users/just/Downloads/LeeSeoyun#1.otf"...
Parsing 'GlyphOrder' table...
Parsing 'head' table...
Parsing 'hhea' table...
Parsing 'maxp' table...
Parsing 'OS/2' table...
Parsing 'name' table...
Parsing 'cmap' table...
Parsing 'post' table...
Parsing 'CFF ' table...
ERROR: Unhandled exception has occurred
Traceback (most recent call last):
File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 405, in main
process(jobs, options)
File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 387, in process
action(input, output, options)
File "/Users/just/code/git/fonttools/Lib/fontTools/misc/loggingTools.py", line 372, in wrapper
return func(*args, **kwds)
File "/Users/just/code/git/fonttools/Lib/fontTools/ttx.py", line 298, in ttCompile
ttf.importXML(input)
File "/Users/just/code/git/fonttools/Lib/fontTools/ttLib/ttFont.py", line 349, in importXML
reader.read()
File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 47, in read
self._parseFile(self.file)
File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 72, in _parseFile
parser.Parse(chunk, 0)
File "/Users/sysadmin/build/v3.10.5/Modules/pyexpat.c", line 470, in EndElement
File "/Users/just/code/git/fonttools/Lib/fontTools/misc/xmlReader.py", line 155, in _endElementHandler
self.currentTable.fromXML(name, attrs, content, self.ttFont)
File "/Users/just/code/git/fonttools/Lib/fontTools/ttLib/tables/C_F_F_.py", line 46, in fromXML
self.cff.fromXML(name, attrs, content, otFont)
File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 346, in fromXML
topDict.fromXML(name, attrs, content)
File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 2618, in fromXML
value = conv.xmlRead(name, attrs, content, self)
File "/Users/just/code/git/fonttools/Lib/fontTools/cffLib/__init__.py", line 1358, in xmlRead
return tobytes(attrs["value"], encoding=("ascii"))
File "/Users/just/code/git/fonttools/Lib/fontTools/misc/textTools.py", line 131, in tobytes
return s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordinal not in range(128)
OTMaster somehow manages to read this field as a Korean string (which makes sense, given the font):
Yet fonttools reads it as gibberish: 'ì\x9d´ì\x84\x9cì\x9c¤ì²´'
.
Issue Analytics
- State:
- Created 10 months ago
- Comments:10 (2 by maintainers)
Top Results From Across the Web
How can you strip non-ASCII characters from a string? (in C#)
It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u#### says which characters match.\ ...
Read more >mail.Address.String() does not take care of non-ASCII name ...
I expected mail.Address.String() can escape the comma for non-ASCII characters. Comma is output without escape and cause error mail: no angle- ...
Read more >Non-ASCII characters in RFCXML - IETF Tools
The use of non-ASCII characters in RFCXML is detailed in RFC 7997. ... (using the fullname, initials, and surname attributes, while the asciiFullname, ......
Read more >Encodings, Unabridged - Yehuda Katz
An encoding specifies how to take a list of characters (such as "hello") and persist them onto disk as a sequence of bytes....
Read more >RFC 6783: Mailing Lists and Non-ASCII Addresses
It outlines some possible scenarios for handling lists with mixtures of non-ASCII and traditional addresses but does not specify protocol changes or offer ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The current version should make sure that it contains ASCII only.
We start with a byte string without knowing the encoding. Converting this to Latin-1 can’t fail. Encoding the converted Latin-1 string as utf-8 can’t fail either. This works backwards as well. Yes, the XML will contain gibberish, but it does round-trip:
Not saying this is a nice solution, but given there is no encoding information at all, and given any non-ascii in FullName can be considered broken, this at least would round-trip without error.
And I agree that if we can plug in that write8bit functionality here, it would be the better and cleaner solution.