Add missing tokenizer test files [:building_construction: in progress]
See original GitHub issueš Add missing tokenizer test files
Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.
Tokenizers concerned
not yet claimed
none
claimed
- LED @nnlnr
- Flaubert @anmolsjoshi
- Electra @Rajathbharadwaj
- ConvBert @elusenji
- RemBert @IMvision12
- Splinter @ashwinjohn3
with an ongoing PR
none
with an accepted PR
How to contribute?
-
Claim a tokenizer
a. Choose a tokenizer from the list of ānot yet claimedā tokenizers
b. Check that no one in the messages for this issue has indicated that they care about this tokenizer
c. Put a message in the issue that you are handling this tokenizer
-
Create a local development setup (if you have not already done it)
I refer you to section āstart-contributing-pull-requestsā of the Contributing guidelines where everything is explained. Donāt be afraid with step 5. For this contribution you will only need to test locally the tests you add.
-
Follow the instructions on the readme inside the
templates/adding_a_missing_tokenization_test
folder to generate the template with cookie cutter for the new test file you will be adding. Donāt forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in thetests
folder. Some details about questionnaire - assuming that the name of the lowcase model isbrand_new_bert
:- āhas_slow_classā: Set true there is a
tokenization_brand_new_bert.py
file in the foldersrc/transformers/models/brand_new_bert
- āhas_fast_classā: Set true there is a
tokenization_brand_new_bert_fast.py
file the foldersrc/transformers/models/brand_new_bert
. - āslow_tokenizer_use_sentencepieceā: Set true if the tokenizer defined in the
tokenization_brand_new_bert.py
file uses sentencepiece. If this tokenizer donāt have a ``tokenization_brand_new_bert.py` file set False.
- āhas_slow_classā: Set true there is a
-
Complete the
setUp
method in the generated test file, you can take inspiration for how it is done for the other tokenizers. -
Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you donāt know what to do we can discuss this in the PR (step 7.).
-
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
-
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.
Tips
Do not hesitate to read the questions / answers in this issue š°
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:42 (24 by maintainers)
Hi @SaulLu, Iād be happy to work on
LED
- Thanks!!Yeah sure @danhphan Thanks.