JSON benchmarks revisited
See original GitHub issueHi, I’m excited by Lark’s speed claims but I’m having trouble replicating. This GitHub issue is more informational than an actual bug report, but I’d be happy to hear your thoughts.
I wrote a JSON parser (below) with Parsimonious that was faster than the one you benchmarked with (by @reclosedev); mine also parses floats better and empty objects ({}
) and arrays ([]
), based on some pattern optimizations I’d found previously (https://github.com/erikrose/parsimonious/issues/119). My NodeVisitor is based on the one by @reclosedev. When I run the parser and Lark’s (with and without a NodeVisitor/Transformer) over a small JSON example 10000 times, I get these times:
implementation | time (tree only) | time (transformed) |
---|---|---|
parsimonious (original) | 4.33s | 8.08s |
parsimonious (mine) | 2.31s | 4.31s |
lark | 3.41s | 3.52s |
The comparison with transformation is somewhat in-line to the claimed speeds, but the difference is much less with my optimized Parsimonious parser, and mine even beats Lark when there is no tree transformation (this has limited utility, but it also shows how slow Parsimonious’s NodeVisitor is compared to Lark’s tree-less parsing, perhaps because Parsimonious builds the Tree and the transformation, so maybe the numbers aren’t directly comparable).
For further comparison, I wrote a Cython-based recursive descent parser with highly tuned pattern recognition but no grammar optimizations, and get these times compared to Python’s C-based json
module. Note that it does not create a tree, but it can ‘scan’ without transforming.
implementation | time (scan only) | time (transformed) |
---|---|---|
cython | 0.057s | 0.177s |
json | n/a | 0.052s |
This perhaps lends support to #114. I found that creating Python objects (even just tuples) slows down things a lot. With Cython I could avoid the dynamic type-checking and use simple structs, which look like tuples in Python, and they are much faster to create.
It could be that my use case of parsing a small JSON object many times instead of one giant one has a different performance profile. Also, it’s worth noting that recursive descent would use a ton of memory to parse large objects, so speed isn’t everything.
Here is my Parsimonious-based parser:
Json = Grammar(r'''
Start = Object / Array
Object = ~"{\s*" Members? ~"\s*}"
Members = MappingComma* Mapping
MappingComma = Mapping ~"\s*,\s*"
Mapping = DQString ~"\s*:\s*" Value
Array = ~"\[\s*" Items? ~"\s*\]"
Items = ValueComma* Value
ValueComma = Value ~"\s*,\s*"
Value = Object / Array / DQString
/ TrueVal / FalseVal / NullVal / Float / Integer
TrueVal = "true"
FalseVal = "false"
NullVal = "null"
DQString = ~"\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""
Float = ~"[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?"
Integer = ~"[-+]?\d+"
''')
class JsonVisitor(NodeVisitor):
""" Produces Python objects from parsed JSON grammar tree
"""
def generic_visit(self, node, visited_children):
return visited_children or node
# helper functions for generic patterns
def combine_many_or_one(self, node, children):
""" Usable for following pattern:
values = value_and_comma* value
"""
members, member = children
if isinstance(members, list):
return members + [member]
return [member]
def lift_first_child(self, node, visited_children):
""" Returns first child from `visited_children`, e.g. for::
rule = item optional another_optional?
returns `item`
"""
return visited_children[0]
# visitors
visit_Start = visit_Value = visit_MappingComma = visit_ValueComma = lift_first_child
visit_Members = combine_many_or_one
def visit_Object(self, node, children):
_, members, _ = children
if isinstance(members, list):
members = members[0]
else:
members = []
return dict(members)
def visit_Array(self, node, children):
_, values, _ = children
if isinstance(values, list):
values = values[0]
else:
values = []
return values
def visit_Mapping(self, node, children):
key, _, value = children
return key, value
def visit_DQString(self, node, visited_children):
# produce unicode for strings
return ast.literal_eval("u" + node.text)
def visit_Float(self, node, visited_children):
return float(node.text)
def visit_Integer(self, node, visited_children):
return int(node.text)
def visit_TrueVal(self, node, visited_children):
return True
def visit_FalseVal(self, node, visited_children):
return False
def visit_NullVal(self, node, visited_children):
return None
The Lark-based one is taken from the tutorial with these invocations:
Json1 = Lark(json_grammar, parser='lalr', lexer='standard')
Json2 = Lark(json_grammar, parser='lalr', lexer='standard', transformer=TreeToJson())
Finally, here is my test case:
{
"bool": [
true,
false
],
"number": {
"float": -0.14e3,
"int": 1
},
"other": {
"string": "string",
"unicode": "あ",
"null": null
}
}
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (5 by maintainers)
Top GitHub Comments
Yes, it is. At least in some cases. This is one of the advantages of having terminals defined as anonymous in the grammar itself, instead of being filtered out by the user: The parser can merge them without worrying about the structure.
@erezsh Parse tree transformations are not yet part of the textparser package, but I implemented it as a separate step after the parsing. Parse tree transformations will become part of the package once there is a need for it. Today only one package uses textparser, and there is no immediate need for transformations in that package.
Here is the JSON parser: https://github.com/eerimoq/textparser/blob/master/examples/json.py