clean_up_tokenization_spaces=True won't clean up spaces
See original GitHub issueEnvironment info
transformers
version: 4.17.0- Platform: Linux-5.10.0-051000-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- PyTorch version (GPU?): 1.10.2+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes (the same on CPU)
- Using distributed or parallel set-up in script?: No
Who can help
@LysandreJik, @Narsil, @SaulLu
Information
Model I am using (Bert, XLNet …): BERT (bert-large-cased
)
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
encoded = tokenizer("This thing costs £4.56")
decoded = tokenizer.decode(encoded["input_ids"], clean_up_tokenization_spaces=True)
print (decoded)
Real output: [CLS] This thing costs £4. 56 [SEP]
I tried it also with NER pipelines and other text inputs.
Additional example: got [CLS] ( including once - a - week tapping ) [SEP]
instead of [CLS] (including once-a-week tapping) [SEP]
Expected behavior
Expected output: [CLS] This thing costs £4.56 [SEP]
.
I expected the tokenizer to cleanup all the spaces introduced. Is there any different way to do so? Am I missing some trivial parameter?
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
How to remove spaces in Excel - leading, trailing, non-breaking
Formula-free way to remove spaces and clean data · Select the cells (range, entire column or row) where you want to delete extra...
Read more >Remove spaces in Excel when trim doesn't work - AuditExcel
Remove spaces in Excel when trim doesn't work, either to remove all the spaces or replaces spaces that Excel can't seem to see....
Read more >Windows Update Cleanup on Windows 11 will not delete ...
After each Windows update and deleting temp files, my Windows update Cleanup does not clear and adds to total amount of space used....
Read more >Clean up space not working at all on Android : r/onedrive
I then excluded OneDrive from the battery saving feature and now the progress bar, stuck at 20%, has been on for two days....
Read more >How to Remove Spaces in Excel Data | Pryor Learning
Note that TRIM will not ADD spaces. Fred Pryor Seminars_Excel Formula Remove Spaces_4. Step 5. Now you just need to replace your original...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I cannot answer in general as I don’t know if you have access to the original string for instance. But if you do have access to it, then
offset_mapping
will be closer to what you expect most of the time.decode
is a best effort way to represent what the model has seen, but it cannot in general output exactly what you sent in.The reason is that
tokenizer.encode
is destructive and looses information. A simple example is that some tokenizer start by.lower()
so we cannot in general recovering the capitalization. The same goes for spaces,decode
will try to add them where they belong, but it cannot work 100% of the time. Here around punctuation for instance You would like to get"£4.56"
but you would want"Hi. How are you doing ?"
(with the extra space after the dot). Since it looks the same to the input ids,decode
has to make a choice and just use one form.Thank you for the explanation. I’m trying to build a custom NER-like system using token classification. I leverage offset_mapping during training to identify the token class. While trying to integrate it in a NER-like pipeline (TokenClassification) I found out it was generating entities from the tokenized version.
However, from your answer, I think it would be better to use char indexes to identify the correct span in the input sentence right?