question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add missing tokenizer test files [:building_construction: in progress]

See original GitHub issue

šŸš€ Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

none

claimed

with an ongoing PR

none

with an accepted PR

How to contribute?

  1. Claim a tokenizer

    a. Choose a tokenizer from the list of ā€œnot yet claimedā€ tokenizers

    b. Check that no one in the messages for this issue has indicated that they care about this tokenizer

    c. Put a message in the issue that you are handling this tokenizer

  2. Create a local development setup (if you have not already done it)

    I refer you to section ā€œstart-contributing-pull-requestsā€ of the Contributing guidelines where everything is explained. Don’t be afraid with step 5. For this contribution you will only need to test locally the tests you add.

  3. Follow the instructions on the readme inside the templates/adding_a_missing_tokenization_test folder to generate the template with cookie cutter for the new test file you will be adding. Don’t forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in the tests folder. Some details about questionnaire - assuming that the name of the lowcase model is brand_new_bert:

    • ā€œhas_slow_classā€: Set true there is a tokenization_brand_new_bert.py file in the folder src/transformers/models/brand_new_bert
    • ā€œhas_fast_classā€: Set true there is a tokenization_brand_new_bert_fast.py file the folder src/transformers/models/brand_new_bert.
    • ā€œslow_tokenizer_use_sentencepieceā€: Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece. If this tokenizer don’t have a ``tokenization_brand_new_bert.py` file set False.
  4. Complete the setUp method in the generated test file, you can take inspiration for how it is done for the other tokenizers.

  5. Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you don’t know what to do we can discuss this in the PR (step 7.).

  6. (Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer

  7. Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.

Tips

Do not hesitate to read the questions / answers in this issue šŸ“°

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:3
  • Comments:42 (24 by maintainers)

github_iconTop GitHub Comments

2reactions
nnlnrcommented, Apr 28, 2022

Hi @SaulLu, I’d be happy to work on LED - Thanks!!

1reaction
IMvision12commented, Oct 30, 2022

Yeah sure @danhphan Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Building a tokenizer, block by block - Hugging Face Course
Running the input through the model (using the pre-tokenized words to produce a sequence of tokens); Post-processing (adding the special tokens of the...
Read more >
Lexical Analysis (Analyzer) in Compiler Design with Example
The role of Lexical Analyzer in compiler design is to read character streams from the source code, check for legal tokens, and pass...
Read more >
spaCy Tutorial – Complete Writeup - Machine Learning Plus
Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to wordsĀ ......
Read more >
Extracting, transforming and selecting features - Apache Spark
Tokenization is the process of taking text (such as a sentence) and breaking ... Building on the StringIndexer example, let's assume we have...
Read more >
Chapter 4. Text Vectorization and Transformation Pipelines
This process is called feature extraction or more simply, vectorization, ... To set this up, let's create a list of our documents and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found