Empties in Earley - predictable Transformer arguments
See original GitHub issueEarley as in the upcoming 0.7b will somewhat correctly handle empty tokens. This issue though is how rules that match empty should be represented in the output tree, especially with regard to visitors & [in-line] transformers and some thoughts/discussion. I’m testing some potential patches here that I think make matching empty a lot simpler with regards to Transformers.
I use in-line Transformers heavily as they make life easier. Viz a simple CSV parser (okay, no-one writes a CSV parser like this - but hopefully it will demo the use case): transformer:
class User(object):
def __init__(self, first_name, last_name, address):
self.first_name = first_name
self.last_name = last_name
self.address = address
@v_args
class MyTransformer(Transformer):
to_list = lambda *i: list(i)
user = User
grammar:
_CS: _COMMA _SPACE?
?first_name: /[a-zA-Z]+/
?last_name: /[a-zA-Z]+/
?address: /[a-zA-Z0-9]+/
users: user* -> to_list
user: first_name _CS last_name _CS address _EOL
input:
Bob,Smith,123MadeUpstreet
The ? ensures the individual items are inlined, and the result of this parse is a list of users with very little code. However, the problem comes if I might have empty fields:
user: first_name? _CS last_name? _CS address?
input: ,Smith,123MadeUpstreet
When modifying this in such a way, with the argument inliner and SingleExpander as they stand, empty arguments disappear. This breaks the constructor, as with the sample above it is now called with:
User('Smith','123MadeUpstreet') # note the empty first_name disappears, python exception.
This could be modified a few ways, but today there doesn’t seem to be a decent way to keep the args consistent so you end up with much more complicated parsing in the constructors. Viz, allowing the individual items to match empty:
?first_name: [ /[a-zA-Z]+/ ]
?last_name: [ /[a-zA-Z]+/ ]
?address: [ /[a-zA-Z0-9]+/ ]
users: user* -> to_list
user: first_name _CS last_name _CS address _EOL
Now in this case, each field can match empty. However, now because the SingleExpander checks explicitly for len(children) == 1, I get: User(Tree(‘first_name’, []), ‘Smith’, ‘123MadeUpstreet’)
Again, this sucks. What I’d like to have is this: User(None, ‘Smith’, ‘123MadeUpstreet’)
I guess this is achievable with the horrible (untested, but should work):
...
class MyTransformer(Transformer):
...
to_none = lambda: None
?first_name: /[a-zA-Z]+/
| -> to_none
?last_name: /[a-zA-Z]+/
| -> to_none
?address: /[a-zA-Z0-9]+/
| -> to_none
This makes my arguments to the Transformer consistent and predictable, and makes it MUCH easier to write grammars/transformers.
There are a couple of ways to implement this:
- Change the SingleExpander to return None when in-lining empty trees (would affect LALR and Earley)
- Change to Earley parser to use a specific Empty token when matching empty, which is how it was originally designed. Earley Empty () eventually became python ‘None’ during implementation, which was probably a bad choice as None is also used elsewhere in the Forest to mean No Value. Would only impact Earley.
- Do either/both of the above and make it optional behind a ‘empty_matches_none’ flag or similar. The problem is I hate having lots of options; and I kind of wonder if empty_matches_none should be the default? Is ‘None’ a better choice for expanding an empty tree?
Would appreciate thoughts and feedback? To me; allowing more predictable arguments with the InlineTransformer makes for simpler parsers in Python; so this would be a big win, but I can’t visualise any bad impacts right now, and I don’t use LALR so I can’t comment there.
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (7 by maintainers)
Top GitHub Comments
@qorrect See
maybe_placeholders
in https://lark-parser.readthedocs.io/en/latest/classes.htmlSo can I use the
*
rule today in 2022 to guarantee a None and keep the arguments consistent ?