Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

clean_up_tokenization_spaces=True won't clean up spaces

See original GitHub issue

Environment info

transformers version: 4.17.0
Platform: Linux-5.10.0-051000-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.10.2+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (the same on CPU)
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik, @Narsil, @SaulLu

Information

Model I am using (Bert, XLNet …): BERT (bert-large-cased)

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

encoded = tokenizer("This thing costs £4.56")
decoded = tokenizer.decode(encoded["input_ids"], clean_up_tokenization_spaces=True)
print (decoded)

Real output: [CLS] This thing costs £4. 56 [SEP]

I tried it also with NER pipelines and other text inputs. Additional example: got [CLS] ( including once - a - week tapping ) [SEP] instead of [CLS] (including once-a-week tapping) [SEP]

Expected behavior

Expected output: [CLS] This thing costs £4.56 [SEP].

I expected the tokenizer to cleanup all the spaces introduced. Is there any different way to do so? Am I missing some trivial parameter?

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

Narsilcommented, Mar 28, 2022

However, from your answer, I think it would be better to use char indexes to identify the correct span in the input sentence right?

I cannot answer in general as I don’t know if you have access to the original string for instance. But if you do have access to it, then offset_mapping will be closer to what you expect most of the time. decode is a best effort way to represent what the model has seen, but it cannot in general output exactly what you sent in.

The reason is that tokenizer.encode is destructive and looses information. A simple example is that some tokenizer start by .lower() so we cannot in general recovering the capitalization. The same goes for spaces, decode will try to add them where they belong, but it cannot work 100% of the time. Here around punctuation for instance You would like to get "£4.56" but you would want "Hi. How are you doing ?" (with the extra space after the dot). Since it looks the same to the input ids, decode has to make a choice and just use one form.

1reaction

MorenoLaQuatracommented, Mar 28, 2022

Thank you for the explanation. I’m trying to build a custom NER-like system using token classification. I leverage offset_mapping during training to identify the token class. While trying to integrate it in a NER-like pipeline (TokenClassification) I found out it was generating entities from the tokenized version.

However, from your answer, I think it would be better to use char indexes to identify the correct span in the input sentence right?