question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JSON benchmarks revisited

See original GitHub issue

Hi, I’m excited by Lark’s speed claims but I’m having trouble replicating. This GitHub issue is more informational than an actual bug report, but I’d be happy to hear your thoughts.

I wrote a JSON parser (below) with Parsimonious that was faster than the one you benchmarked with (by @reclosedev); mine also parses floats better and empty objects ({}) and arrays ([]), based on some pattern optimizations I’d found previously (https://github.com/erikrose/parsimonious/issues/119). My NodeVisitor is based on the one by @reclosedev. When I run the parser and Lark’s (with and without a NodeVisitor/Transformer) over a small JSON example 10000 times, I get these times:

implementation time (tree only) time (transformed)
parsimonious (original) 4.33s 8.08s
parsimonious (mine) 2.31s 4.31s
lark 3.41s 3.52s

The comparison with transformation is somewhat in-line to the claimed speeds, but the difference is much less with my optimized Parsimonious parser, and mine even beats Lark when there is no tree transformation (this has limited utility, but it also shows how slow Parsimonious’s NodeVisitor is compared to Lark’s tree-less parsing, perhaps because Parsimonious builds the Tree and the transformation, so maybe the numbers aren’t directly comparable).

For further comparison, I wrote a Cython-based recursive descent parser with highly tuned pattern recognition but no grammar optimizations, and get these times compared to Python’s C-based json module. Note that it does not create a tree, but it can ‘scan’ without transforming.

implementation time (scan only) time (transformed)
cython 0.057s 0.177s
json n/a 0.052s

This perhaps lends support to #114. I found that creating Python objects (even just tuples) slows down things a lot. With Cython I could avoid the dynamic type-checking and use simple structs, which look like tuples in Python, and they are much faster to create.

It could be that my use case of parsing a small JSON object many times instead of one giant one has a different performance profile. Also, it’s worth noting that recursive descent would use a ton of memory to parse large objects, so speed isn’t everything.

Here is my Parsimonious-based parser:

    Json = Grammar(r'''
        Start    = Object / Array
        Object   = ~"{\s*" Members? ~"\s*}"
        Members  = MappingComma* Mapping
        MappingComma = Mapping ~"\s*,\s*"
        Mapping  = DQString ~"\s*:\s*" Value
        Array    = ~"\[\s*" Items? ~"\s*\]"
        Items    = ValueComma* Value
        ValueComma = Value ~"\s*,\s*" 
        Value    = Object / Array / DQString
                 / TrueVal / FalseVal / NullVal / Float / Integer
        TrueVal  = "true"
        FalseVal = "false"
        NullVal  = "null"
        DQString = ~"\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""
        Float    = ~"[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?"
        Integer  = ~"[-+]?\d+"
    ''')

    class JsonVisitor(NodeVisitor):
        """ Produces Python objects from parsed JSON grammar tree
        """
        def generic_visit(self, node, visited_children):
            return visited_children or node

        # helper functions for generic patterns
        def combine_many_or_one(self, node, children):
            """ Usable for following pattern:
    
                values = value_and_comma* value
            """
            members, member = children
            if isinstance(members, list):
                return members + [member]
            return [member]

        def lift_first_child(self, node, visited_children):
            """ Returns first child from `visited_children`, e.g. for::
    
                rule = item optional another_optional?
    
            returns `item`
            """
            return visited_children[0]

        # visitors
        visit_Start = visit_Value = visit_MappingComma = visit_ValueComma = lift_first_child
        visit_Members = combine_many_or_one

        def visit_Object(self, node, children):
            _, members, _ = children
            if isinstance(members, list):
                members = members[0]
            else:
                members = []
            return dict(members)

        def visit_Array(self, node, children):
            _, values, _ = children
            if isinstance(values, list):
                values = values[0]
            else:
                values = []
            return values

        def visit_Mapping(self, node, children):
            key, _, value = children
            return key, value

        def visit_DQString(self, node, visited_children):
            # produce unicode for strings
            return ast.literal_eval("u" + node.text)

        def visit_Float(self, node, visited_children):
            return float(node.text)

        def visit_Integer(self, node, visited_children):
            return int(node.text)

        def visit_TrueVal(self, node, visited_children):
            return True

        def visit_FalseVal(self, node, visited_children):
            return False

        def visit_NullVal(self, node, visited_children):
            return None

The Lark-based one is taken from the tutorial with these invocations:

    Json1 = Lark(json_grammar, parser='lalr', lexer='standard')
    Json2 = Lark(json_grammar, parser='lalr', lexer='standard', transformer=TreeToJson())

Finally, here is my test case:

{
    "bool": [
        true,
        false
    ],
    "number": {
        "float": -0.14e3,
        "int": 1
    },
    "other": {
        "string": "string",
        "unicode": "あ",
        "null": null
    }
}

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
erezshcommented, Sep 2, 2018

I wonder if it’s possible to automatically combine patterns that will be thrown away

Yes, it is. At least in some cases. This is one of the advantages of having terminals defined as anonymous in the grammar itself, instead of being filtered out by the user: The parser can merge them without worrying about the structure.

1reaction
eerimoqcommented, Sep 1, 2018

@erezsh Parse tree transformations are not yet part of the textparser package, but I implemented it as a separate step after the parsing. Parse tree transformations will become part of the package once there is a need for it. Today only one package uses textparser, and there is no immediate need for transformations in that package.

Here is the JSON parser: https://github.com/eerimoq/textparser/blob/master/examples/json.py

Read more comments on GitHub >

github_iconTop Results From Across the Web

Revisiting a (JSON) Benchmark | Julien Ponge
Rick Hightower recently published a JSON JVM libraries benchmark where his project Boon was the clear winner.
Read more >
Faster, more memory-efficient Python JSON parsing with ...
We'll revisit the example from my article on streaming JSON parsing. Specifically, we're going to be parsing a ~25MB file that encodes a...
Read more >
JSON to CSV | Revisited - The Automation Hub
This tutorial follows on from our previous JSON to CSV tutorial where we stepped ... as the basis for this benchmark and our...
Read more >
Wisconsin Benchmark Data Generator: To JSON and Beyond ...
Wisconsin Benchmark Data Generator: To JSON and Beyond · 20+ million members · 135+ million publications · 700k+ research projects.
Read more >
Arrays in JSON: Modeling, Querying and Indexing Performance
All JSON document databases like Couchbase, MongoDB recommend you to denormalize your data model to improve your performance and appdev. What ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found