question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pyyaml does not support literals in unicode over codepoint 0xffff

See original GitHub issue

See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=806826

the yaml spec says that

“The allowed character range explicitly excludes the surrogate block #xD800-#xDFFF, DEL #x7F, the C0 control block #x0-#x1F (except for #x9, #xA, and #xD), the C1 control block #x80-#x9F, #xFFFE, and #xFFFF.”

however pyyaml has chosen to negate that check and apply it to only plane 0. This means that any yaml document that contains unicode literals in higher planes will fail to parse (and, on output, use the rather unfriendly \Uxxxxxxxx format).

The attached patch fixes this in a minimally intrusive way, by extending the checks to cover the additional codepoints where appropriate. A better fix would be to use the check as the spec specifies it, but that would be a bigger change.

Index: pyyaml-3.11/lib/yaml/emitter.py

— pyyaml-3.11.orig/lib/yaml/emitter.py +++ pyyaml-3.11/lib/yaml/emitter.py @@ -8,9 +8,13 @@

all = [‘Emitter’, ‘EmitterError’]

+import sys + from error import YAMLError from events import *

+has_ucs4 = sys.maxunicode > 0xffff + class EmitterError(YAMLError): pass

@@ -701,7 +705,8 @@ class Emitter(object): line_breaks = True if not (ch == u’\n’ or u’\x20’ <= ch <= u’\x7E’): if (ch == u’\x85’ or u’\xA0’ <= ch <= u’\uD7FF’

  •                    or u'\uE000' <= ch <= u'\uFFFD') and ch != u'\uFEFF':
    
  •                    or u'\uE000' <= ch <= u'\uFFFD'
    
  •                    or ((not has_ucs4) or (u'\U00010000' <= ch < u'\U0010ffff'))) and ch != u'\uFEFF':
                 unicode_characters = True
                 if not self.allow_unicode:
                     special_characters = True
    

    Index: pyyaml-3.11/lib/yaml/reader.py

    — pyyaml-3.11.orig/lib/yaml/reader.py +++ pyyaml-3.11/lib/yaml/reader.py @@ -19,7 +19,9 @@ all = [‘Reader’, ‘ReaderError’]

    from error import YAMLError, Mark

-import codecs, re +import codecs, re, sys + +has_ucs4 = sys.maxunicode > 0xffff

class ReaderError(YAMLError):

@@ -134,7 +136,10 @@ class Reader(object): self.encoding = ‘utf-8’ self.update(1)

  • NON_PRINTABLE = re.compile(u’[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')

  • if has_ucs4:

  •    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]')
    
  • else:

  •    NON_PRINTABLE = re.compile(u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]')
    

    def check_printable(self, data): match = self.NON_PRINTABLE.search(data) if match: Index: pyyaml-3.11/lib3/yaml/emitter.py

    — pyyaml-3.11.orig/lib3/yaml/emitter.py +++ pyyaml-3.11/lib3/yaml/emitter.py @@ -698,7 +698,8 @@ class Emitter: line_breaks = True if not (ch == ‘\n’ or ‘\x20’ <= ch <= ‘\x7E’): if (ch == ‘\x85’ or ‘\xA0’ <= ch <= ‘\uD7FF’

  •                    or '\uE000' <= ch <= '\uFFFD') and ch != '\uFEFF':
    
  •                    or '\uE000' <= ch <= '\uFFFD'
    
  •                    or '\U00010000' <= ch < '\U0010ffff') and ch != '\uFEFF':
                 unicode_characters = True
                 if not self.allow_unicode:
                     special_characters = True
    

    Index: pyyaml-3.11/lib3/yaml/reader.py

    — pyyaml-3.11.orig/lib3/yaml/reader.py +++ pyyaml-3.11/lib3/yaml/reader.py @@ -134,7 +134,7 @@ class Reader(object): self.encoding = ‘utf-8’ self.update(1)

  • NON_PRINTABLE = re.compile(‘[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD]’)

  • NON_PRINTABLE = re.compile(‘[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010ffff]’) def check_printable(self, data): match = self.NON_PRINTABLE.search(data) if match:

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:3
  • Comments:20 (8 by maintainers)

github_iconTop GitHub Comments

4reactions
jlevycommented, Nov 24, 2017

Just thought I’d share, in case others are in the same boat: After patching/working around this for a while, we realized https://pypi.python.org/pypi/ruamel.yaml handles higher Unicodes just fine, and it’s worked out well for us.

1reaction
perlpunkcommented, Mar 16, 2019

Fixed by #63

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyYAML is a YAML parser and emitter for Python.
The parsing algorithm is simple enough to be a reference for YAML parser implementors. Unicode support including UTF-8/UTF-16 input/output and *; low-level ...
Read more >
CHANGES - third_party/pyyaml - Git at Google
https://github.com/yaml/pyyaml/pull/63 -- Adding support to Unicode characters over ... Do not try to build LibYAML bindings on platforms other than CPython.
Read more >
PyYaml - Dump unicode with special characters ( i.e. accents )
The yaml dump works perfectly for loading with yaml, but it is not human readable. As you can see in the exemple code,...
Read more >
Why Are Pyyaml And Ruamel.Yaml Escaping Special ...
. 4 months ago; cf1c86c First attack at pyyaml does not support literals in unicode over codepoint 0xffff #25 by Peter Murphy 4...
Read more >
https://raw.githubusercontent.com/yaml/pyyaml/5.4....
... Fix reader for Unicode code points over 0xFFFF * https://github.com/yaml/pyyaml/pull/360 -- Enable certain unicode tests when maxunicode not > 0xffff ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found