question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Padding Strategy Code missing an else case (maybe?)

See original GitHub issue

Environment info

  • transformers version: 3.0.2
  • Platform: macOS 10.15.5
  • Python version: 3.7
  • PyTorch version (GPU?): 1.5 GPU-Yes
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

tokenizers: @mfuntowicz Summarization: @sshleifer T5: @patrickvonplaten

Information

Model I am using (T5 via Autotokenizer):

The problem arises when using: tokenizer([line], max_length=max_length, padding='max_length' if pad_to_max_length else False, truncation=True, return_tensors=return_tensors, **extra_kw)

In batch encoding, the latest code decides on a padding strategy: _get_padding_truncation_strategies( self, padding=False, truncation=False, max_length=None, pad_to_multiple_of=None, verbose=True, **kwargs ):

   ` elif padding is not False:
        if padding is True:
            padding_strategy = PaddingStrategy.LONGEST  # Default to pad to the longest sequence in the batch
        elif not isinstance(padding, PaddingStrategy):
            padding_strategy = PaddingStrategy(padding)`

While calling the tokenizer, instead of ‘max_length’ I first gave the actual PaddingStrategy.MAX_LENGTH Enum as argument, but the above code throws an error as ‘padding_strategy’ is not defined.

To reproduce

Call the tokenizer as: tokenizer([line], max_length=max_length, padding=PaddingStrategy.MAX_LENGTH if pad_to_max_length else False, truncation=True, return_tensors=return_tensors, **extra_kw)

Expected behavior

The PaddingStrategy enum should be assigned no issue.

##Suggested Solution

                ` elif padding is not False:
                          if padding is True:
                             padding_strategy = PaddingStrategy.LONGEST  # Default to pad to the longest sequence in the batch
                  elif not isinstance(padding, PaddingStrategy):
                         padding_strategy = PaddingStrategy(padding)
                  else:
                      padding_strategy = padding`        

It’s a one line fix basically, I can raise a PR for the same, unless PaddingStrategy wasn’t designed to be used directly?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
aphedgescommented, Sep 4, 2020

This issue also applies to the truncation parameter.

I assumed the enums are supposed to be used directly because the release notes (https://github.com/huggingface/transformers/releases/tag/v3.0.0) explicitly mention the TensorType enum, which is defined right below the PaddingStrategy and TruncationStrategy enums.

I agree that this is a problem that should be fixed, if the enums are meant to be used.

0reactions
sshleifercommented, Nov 5, 2020

Nice, thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python: Ignore 'Incorrect padding' error when base64 decoding
Show activity on this post. "Incorrect padding" can mean not only "missing padding" but also (believe it or not) "incorrect padding". Update: Any...
Read more >
NPM & left-pad: Have We Forgotten How To Program?
In my opinion, if you cannot write a left-pad, is-positive-integer, or isArray function in 5 minutes flat (including the time you spend ...
Read more >
The Lost Art of Structure Packing - Catb.org
This page is about a technique for reducing the memory footprint of programs in compiled languages with C-like structures - manually repacking these ......
Read more >
Working with missing data — pandas 1.5.2 documentation
See the cookbook for some advanced strategies. Values considered “missing”#. As data comes in many shapes and forms, pandas aims to be flexible...
Read more >
Analysis of Potential Bill Padding - State Bar of California
the report urged attorneys to record the hours spent on each case in order to ... upward adjustments, the increased time may be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found