Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Encoding.word_to_tokens() returns None within valid sequence

See original GitHub issue

System Info

transformers version: 4.23.1
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.10.6
Huggingface_hub version: 0.10.1
PyTorch version (GPU?): 1.12.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes(?)
Using distributed or parallel set-up in script?: no

Who can help?

@SaulLu @sgugger @stevhliu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Tokenize a sentence -> BatchEncoding
Iterate over word_ids
Call word_to_chars(word_index)
TypeError is raised at arbitrary word index (see output below)

MODEL_NAME = "DTAI-KULeuven/robbertje-1-gb-non-shuffled"
MODEL_MAX_LENGTH = 512

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME, model_max_length=MODEL_MAX_LENGTH, truncation=True
)
text = "Dit is een goede tekst."

encoding = tokenizer(text)


for word_index in range(len(encoding.word_ids())):
    if word_index is not None:
        print(word_index)
        char_span = encoding.word_to_chars(word_index)

0
1
2
3
4
5
6
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tokenization_test.ipynb Cell 3 in <cell line: 1>()
      [2](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=1) if word_index is not None:
      [3](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=2)     print(word_index)
----> [4](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=3)     char_span = encoding.word_to_chars(word_index)

File ~/opt/anaconda3/envs/SoS/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:615, in BatchEncoding.word_to_chars(self, batch_or_word_index, word_index, sequence_index)
    613     batch_index = 0
    614     word_index = batch_or_word_index
--> 615 return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index, sequence_index)))

TypeError: transformers.tokenization_utils_base.CharSpan() argument after * must be an iterable, not NoneType

The word index is valid:

encoding.word_ids()[word_index:word_index+10]
[164, 165, 166, 166, 166, 166, 167, 168, 168, 168]

On further investigation, I have noticed that there is a work-around by validating there is a word-to-token mapping for the word index:

if word_index is not None and encoding.word_to_tokens(word_index) is not None:
    [...]

So the underlying issue seems to be that word_to_tokens() sometimes returns None, although it seems counter-intuitive that there are words in a texts that do not have corresponding tokens.

Expected behavior

BatchEncoding.word_to_tokens() should not output None; or it should be documented why/if this can happen.

Issue Analytics

State:
Created 10 months ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

carschnocommented, Nov 25, 2022

I suppose you are right. I had some doubts because in my aforementioned original text (long erroneous), this occurred somewhere in the middle of the text. I will try to reproduce, but I guess there might have been special tokens as well due to longer sequences of whitespace and/or punctuation.

0reactions

SaulLucommented, Nov 25, 2022

But in the example I used in my investigations (and pasted here), this is not the case either.

I could be wrong but it seems to me that your example uses a template. We can see it by running the following code:

print(encoding.word_ids())
print(tokenizer.convert_ids_to_tokens(encoding.input_ids))

which gives:

[None, 0, 1, 2, 3, 4, 5, None]
['<s>', 'Dit', 'Ġis', 'Ġeen', 'Ġgoede', 'Ġtekst', '.', '</s>']

Here we can see that the None correspond to the “template” tokens '<s>' and '</s>'.

Top Results From Across the Web

Tokenizer — transformers 2.11.0 documentation - Hugging Face

Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes inputs and checks the number of added...

Python Script returns unintended "None" after execution of a ...

In python the default return value of a function is None . >>> def func():pass >>> print func() #print or print() prints the...

bpe - Go Packages

Token); func (b *BPE) WordToTokens(word Word) []tokenizer.Token ... GetVocab returns BPE vocab func (b *BPE) GetVocab() *model.Vocab { ...

Captioning Chest X Rays with Deep Learning

The CNN Encoder: The model takes in a single raw image and generated a caption y encoded as a sequence of 1 to...

7.2. codecs — Codec registry and base classes - Read the Docs

Looks up the codec info in the Python codec registry and returns a CodecInfo object as ... CodecInfo (encode, decode, streamreader=None, streamwriter=None, ...