question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Encoding.word_to_tokens() returns None within valid sequence

See original GitHub issue

System Info

  • transformers version: 4.23.1
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.10.6
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.12.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes(?)
  • Using distributed or parallel set-up in script?: no

Who can help?

@SaulLu @sgugger @stevhliu

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Tokenize a sentence -> BatchEncoding
  2. Iterate over word_ids
  3. Call word_to_chars(word_index)
  4. TypeError is raised at arbitrary word index (see output below)
MODEL_NAME = "DTAI-KULeuven/robbertje-1-gb-non-shuffled"
MODEL_MAX_LENGTH = 512

from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME, model_max_length=MODEL_MAX_LENGTH, truncation=True
)
text = "Dit is een goede tekst."

encoding = tokenizer(text)


for word_index in range(len(encoding.word_ids())):
    if word_index is not None:
        print(word_index)
        char_span = encoding.word_to_chars(word_index)

0
1
2
3
4
5
6
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tokenization_test.ipynb Cell 3 in <cell line: 1>()
      [2](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=1) if word_index is not None:
      [3](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=2)     print(word_index)
----> [4](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=3)     char_span = encoding.word_to_chars(word_index)

File ~/opt/anaconda3/envs/SoS/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:615, in BatchEncoding.word_to_chars(self, batch_or_word_index, word_index, sequence_index)
    613     batch_index = 0
    614     word_index = batch_or_word_index
--> 615 return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index, sequence_index)))

TypeError: transformers.tokenization_utils_base.CharSpan() argument after * must be an iterable, not NoneType

The word index is valid:

encoding.word_ids()[word_index:word_index+10]
[164, 165, 166, 166, 166, 166, 167, 168, 168, 168]

On further investigation, I have noticed that there is a work-around by validating there is a word-to-token mapping for the word index:

if word_index is not None and encoding.word_to_tokens(word_index) is not None:
    [...]

So the underlying issue seems to be that word_to_tokens() sometimes returns None, although it seems counter-intuitive that there are words in a texts that do not have corresponding tokens.

Expected behavior

BatchEncoding.word_to_tokens() should not output None; or it should be documented why/if this can happen.

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
carschnocommented, Nov 25, 2022

I suppose you are right. I had some doubts because in my aforementioned original text (long erroneous), this occurred somewhere in the middle of the text. I will try to reproduce, but I guess there might have been special tokens as well due to longer sequences of whitespace and/or punctuation.

0reactions
SaulLucommented, Nov 25, 2022

But in the example I used in my investigations (and pasted here), this is not the case either.

I could be wrong but it seems to me that your example uses a template. We can see it by running the following code:

print(encoding.word_ids())
print(tokenizer.convert_ids_to_tokens(encoding.input_ids))

which gives:

[None, 0, 1, 2, 3, 4, 5, None]
['<s>', 'Dit', 'Ġis', 'Ġeen', 'Ġgoede', 'Ġtekst', '.', '</s>']

Here we can see that the None correspond to the “template” tokens '<s>' and '</s>'.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer — transformers 2.11.0 documentation - Hugging Face
Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes inputs and checks the number of added...
Read more >
Python Script returns unintended "None" after execution of a ...
In python the default return value of a function is None . >>> def func():pass >>> print func() #print or print() prints the...
Read more >
bpe - Go Packages
Token); func (b *BPE) WordToTokens(word Word) []tokenizer.Token ... GetVocab returns BPE vocab func (b *BPE) GetVocab() *model.Vocab { ...
Read more >
Captioning Chest X Rays with Deep Learning
The CNN Encoder: The model takes in a single raw image and generated a caption y encoded as a sequence of 1 to...
Read more >
7.2. codecs — Codec registry and base classes - Read the Docs
Looks up the codec info in the Python codec registry and returns a CodecInfo object as ... CodecInfo (encode, decode, streamreader=None, streamwriter=None, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found