Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add missing tokenizer test files [:building_construction: in progress]

See original GitHub issue

🚀 Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

none

claimed

LED @nnlnr
Flaubert @anmolsjoshi
Electra @Rajathbharadwaj
ConvBert @elusenji
RemBert @IMvision12
Splinter @ashwinjohn3

with an ongoing PR

none

with an accepted PR

Longformer @tgadeliya #17677
MobileBert @leondz #16896
RetriBert @mpoemsl #17017

How to contribute?

Claim a tokenizer

a. Choose a tokenizer from the list of “not yet claimed” tokenizers

b. Check that no one in the messages for this issue has indicated that they care about this tokenizer

c. Put a message in the issue that you are handling this tokenizer
Create a local development setup (if you have not already done it)

I refer you to section “start-contributing-pull-requests” of the Contributing guidelines where everything is explained. Don’t be afraid with step 5. For this contribution you will only need to test locally the tests you add.
Follow the instructions on the readme inside the templates/adding_a_missing_tokenization_test folder to generate the template with cookie cutter for the new test file you will be adding. Don’t forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in the tests folder. Some details about questionnaire - assuming that the name of the lowcase model is brand_new_bert:
- “has_slow_class”: Set true there is a tokenization_brand_new_bert.py file in the folder src/transformers/models/brand_new_bert
- “has_fast_class”: Set true there is a tokenization_brand_new_bert_fast.py file the folder src/transformers/models/brand_new_bert.
- “slow_tokenizer_use_sentencepiece”: Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece. If this tokenizer don’t have a ``tokenization_brand_new_bert.py` file set False.
Complete the setUp method in the generated test file, you can take inspiration for how it is done for the other tokenizers.
Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you don’t know what to do we can discuss this in the PR (step 7.).
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.

Tips

Do not hesitate to read the questions / answers in this issue 📰

Issue Analytics

State:
Created a year ago
Reactions:3
Comments:42 (24 by maintainers)

Top GitHub Comments

2reactions

nnlnrcommented, Apr 28, 2022

Hi @SaulLu, I’d be happy to work on LED - Thanks!!

1reaction

IMvision12commented, Oct 30, 2022

Yeah sure @danhphan Thanks.

Top Results From Across the Web

Building a tokenizer, block by block - Hugging Face Course

Running the input through the model (using the pre-tokenized words to produce a sequence of tokens); Post-processing (adding the special tokens of the...

Lexical Analysis (Analyzer) in Compiler Design with Example

The role of Lexical Analyzer in compiler design is to read character streams from the source code, check for legal tokens, and pass...

spaCy Tutorial – Complete Writeup - Machine Learning Plus

Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words ......

Extracting, transforming and selecting features - Apache Spark

Tokenization is the process of taking text (such as a sentence) and breaking ... Building on the StringIndexer example, let's assume we have...

Chapter 4. Text Vectorization and Transformation Pipelines

This process is called feature extraction or more simply, vectorization, ... To set this up, let's create a list of our documents and...