Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Empties in Earley - predictable Transformer arguments

See original GitHub issue

Earley as in the upcoming 0.7b will somewhat correctly handle empty tokens. This issue though is how rules that match empty should be represented in the output tree, especially with regard to visitors & [in-line] transformers and some thoughts/discussion. I’m testing some potential patches here that I think make matching empty a lot simpler with regards to Transformers.

I use in-line Transformers heavily as they make life easier. Viz a simple CSV parser (okay, no-one writes a CSV parser like this - but hopefully it will demo the use case): transformer:

class User(object):
  def __init__(self, first_name, last_name, address):
    self.first_name = first_name
    self.last_name = last_name
    self.address = address

@v_args
class MyTransformer(Transformer):
  to_list = lambda *i: list(i)
  user = User

grammar:

_CS: _COMMA _SPACE?
?first_name: /[a-zA-Z]+/
?last_name: /[a-zA-Z]+/
?address: /[a-zA-Z0-9]+/
users: user* -> to_list
user: first_name _CS last_name _CS address _EOL

input:

Bob,Smith,123MadeUpstreet

The ? ensures the individual items are inlined, and the result of this parse is a list of users with very little code. However, the problem comes if I might have empty fields:

user: first_name? _CS last_name? _CS address?

input: ,Smith,123MadeUpstreet When modifying this in such a way, with the argument inliner and SingleExpander as they stand, empty arguments disappear. This breaks the constructor, as with the sample above it is now called with:

User('Smith','123MadeUpstreet')  # note the empty first_name disappears, python exception.

This could be modified a few ways, but today there doesn’t seem to be a decent way to keep the args consistent so you end up with much more complicated parsing in the constructors. Viz, allowing the individual items to match empty:

?first_name: [ /[a-zA-Z]+/ ]
?last_name: [ /[a-zA-Z]+/ ]
?address: [ /[a-zA-Z0-9]+/ ]
users: user* -> to_list
user: first_name _CS last_name _CS address _EOL

Now in this case, each field can match empty. However, now because the SingleExpander checks explicitly for len(children) == 1, I get: User(Tree(‘first_name’, []), ‘Smith’, ‘123MadeUpstreet’)

Again, this sucks. What I’d like to have is this: User(None, ‘Smith’, ‘123MadeUpstreet’)

I guess this is achievable with the horrible (untested, but should work):

...
class MyTransformer(Transformer):
...
   to_none = lambda: None

?first_name: /[a-zA-Z]+/
           | -> to_none
?last_name: /[a-zA-Z]+/
          | -> to_none
?address: /[a-zA-Z0-9]+/
        | -> to_none

This makes my arguments to the Transformer consistent and predictable, and makes it MUCH easier to write grammars/transformers.

There are a couple of ways to implement this:

Change the SingleExpander to return None when in-lining empty trees (would affect LALR and Earley)
Change to Earley parser to use a specific Empty token when matching empty, which is how it was originally designed. Earley Empty () eventually became python ‘None’ during implementation, which was probably a bad choice as None is also used elsewhere in the Forest to mean No Value. Would only impact Earley.
Do either/both of the above and make it optional behind a ‘empty_matches_none’ flag or similar. The problem is I hate having lots of options; and I kind of wonder if empty_matches_none should be the default? Is ‘None’ a better choice for expanding an empty tree?

Would appreciate thoughts and feedback? To me; allowing more predictable arguments with the InlineTransformer makes for simpler parsers in Python; so this would be a big win, but I can’t visualise any bad impacts right now, and I don’t use LALR so I can’t comment there.

Issue Analytics

State:
Created 5 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

2reactions

erezshcommented, Mar 20, 2022

@qorrect See maybe_placeholders in https://lark-parser.readthedocs.io/en/latest/classes.html

0reactions

charlie-sanderscommented, Mar 20, 2022

So can I use the * rule today in 2022 to guarantee a None and keep the arguments consistent ?

Top Results From Across the Web

Transformers Integration Classes — small-text documentation

Transformer -based Classification ... Arguments to enable and configure gradual unfreezing and discriminative learning rates as used in Universal Language Model ......

Source code for transformers.pipelines - Hugging Face

pipelines.ArgumentHandler`, `optional`): Reference to the object in charge of parsing supplied pipeline parameters. device (:obj:`int`, ` ...

NormFormer: Improved Transformer Pretraining with Extra ...

Summary Of The Review: In summary, this paper proposes to add four operations, two LayerNorms and two scaling parameters, in the Pre-LN transformer...

arXiv:2207.07061v2 [cs.CL] 25 Oct 2022

Multiple early-exit techniques for encoder-only Transformers (e.g., ... First, we control for the correctness of the predicted tokens to ...

Tips and Tricks - Simple Transformers

Early stopping is a technique used to prevent model overfitting. ... The first is the learned parameters (like the model weights) and the ......