question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parsing of trailing TAB works differently for Python and C

See original GitHub issue

This works properly (note the trailing TAB):

>>> from yaml import CLoader as Loader, CDumper as Dumper
>>> data = load('"bar"\t', Loader=Loader)

This fails:

>>> from yaml import Loader, Dumper
>>> data = load('"bar"\t', Loader=Loader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
    return loader.get_single_data()
  File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/usr/lib/python3/dist-packages/yaml/composer.py", line 35, in get_single_node
    if not self.check_event(StreamEndEvent):
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/lib/python3/dist-packages/yaml/parser.py", line 142, in parse_implicit_document_start
    if not self.check_token(DirectiveToken, DocumentStartToken,
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
    raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
  in "<unicode string>", line 1, column 6:
    "bar"	
         ^

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ingydotnetcommented, Dec 21, 2021

Ah but they are not the very same parser. They are the 2 distinctly different parsers that PyYAML contains. A pure Python one and libyaml. Note that libyaml was originally a direct port from PyYAML, written by the same person.

There are several known places where PyYAML using pure Python and PyYAML using libyaml differ. These are of course bugs, either in the Python code or libyaml. For the test case you posted, libyaml parses according to the spec, and PyYAML’s python parser has a bug. That’s why I said libyaml is right and PyYAML (the Python code) is wrong.

Note: It was my understanding that you were trying to find out how to interpret the spec, so that you could implement your SnakeYAML Java YAML parser correctly.

1reaction
ingydotnetcommented, Dec 20, 2021

The short answer to your query is that in this case libyaml is right and pyyaml is wrong.

https://play.yaml.io/main/parser?input=ImJhciIJ shows the results of 14 YAML parsers, and PyYAML, Ruamel (fork of PyYAML) and SnakeYAML get this one wrong. The New Reference Parser there is literally generated from the spec productions and therefore is almost always correct in its interpretation. That might be a useful resource for you.

The productions involved are:

Which is spaces and tabs. Put another way, non-indentation whitespace is usually tabs and spaces.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tabs or spaces – Parsing a 1B files among 14 ...
Python treats a tab character as the amount of spaces needed to go into the nearest multiple of 8 column. With py2 ability...
Read more >
python - "inconsistent use of tabs and spaces in indentation"
I'm trying to create an application in Python 3.2 and I use tabs all the time for indentation, but even the editor changes...
Read more >
2. Lexical analysis — Python 3.11.1 documentation
A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. This...
Read more >
How to use Split in Python Explained
Understand the Python split function and the different ways to ... If you have worked on the concatenation of strings that are used...
Read more >
PLY (Python Lex-Yacc)
The main goal of PLY is to stay fairly faithful to the way in which traditional lex/yacc tools work. This includes supporting LALR(1)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found