Parsing of trailing TAB works differently for Python and C
See original GitHub issueThis works properly (note the trailing TAB):
>>> from yaml import CLoader as Loader, CDumper as Dumper
>>> data = load('"bar"\t', Loader=Loader)
This fails:
>>> from yaml import Loader, Dumper
>>> data = load('"bar"\t', Loader=Loader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 114, in load
return loader.get_single_data()
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 49, in get_single_data
node = self.get_single_node()
File "/usr/lib/python3/dist-packages/yaml/composer.py", line 35, in get_single_node
if not self.check_event(StreamEndEvent):
File "/usr/lib/python3/dist-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib/python3/dist-packages/yaml/parser.py", line 142, in parse_implicit_document_start
if not self.check_token(DirectiveToken, DocumentStartToken,
File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib/python3/dist-packages/yaml/scanner.py", line 258, in fetch_more_tokens
raise ScannerError("while scanning for the next token", None,
yaml.scanner.ScannerError: while scanning for the next token
found character '\t' that cannot start any token
in "<unicode string>", line 1, column 6:
"bar"
^
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Tabs or spaces – Parsing a 1B files among 14 ...
Python treats a tab character as the amount of spaces needed to go into the nearest multiple of 8 column. With py2 ability...
Read more >python - "inconsistent use of tabs and spaces in indentation"
I'm trying to create an application in Python 3.2 and I use tabs all the time for indentation, but even the editor changes...
Read more >2. Lexical analysis — Python 3.11.1 documentation
A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. This...
Read more >How to use Split in Python Explained
Understand the Python split function and the different ways to ... If you have worked on the concatenation of strings that are used...
Read more >PLY (Python Lex-Yacc)
The main goal of PLY is to stay fairly faithful to the way in which traditional lex/yacc tools work. This includes supporting LALR(1)...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah but they are not the very same parser. They are the 2 distinctly different parsers that PyYAML contains. A pure Python one and libyaml. Note that libyaml was originally a direct port from PyYAML, written by the same person.
There are several known places where PyYAML using pure Python and PyYAML using libyaml differ. These are of course bugs, either in the Python code or libyaml. For the test case you posted, libyaml parses according to the spec, and PyYAML’s python parser has a bug. That’s why I said libyaml is right and PyYAML (the Python code) is wrong.
Note: It was my understanding that you were trying to find out how to interpret the spec, so that you could implement your SnakeYAML Java YAML parser correctly.
The short answer to your query is that in this case libyaml is right and pyyaml is wrong.
https://play.yaml.io/main/parser?input=ImJhciIJ shows the results of 14 YAML parsers, and PyYAML, Ruamel (fork of PyYAML) and SnakeYAML get this one wrong. The New Reference Parser there is literally generated from the spec productions and therefore is almost always correct in its interpretation. That might be a useful resource for you.
The productions involved are:
Which is spaces and tabs. Put another way, non-indentation whitespace is usually tabs and spaces.