Encoding.word_to_tokens() returns None within valid sequence
See original GitHub issueSystem Info
transformers
version: 4.23.1- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.10.6
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.12.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes(?)
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- Tokenize a sentence ->
BatchEncoding
- Iterate over
word_ids
- Call
word_to_chars(word_index)
TypeError
is raised at arbitrary word index (see output below)
MODEL_NAME = "DTAI-KULeuven/robbertje-1-gb-non-shuffled"
MODEL_MAX_LENGTH = 512
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
MODEL_NAME, model_max_length=MODEL_MAX_LENGTH, truncation=True
)
text = "Dit is een goede tekst."
encoding = tokenizer(text)
for word_index in range(len(encoding.word_ids())):
if word_index is not None:
print(word_index)
char_span = encoding.word_to_chars(word_index)
0
1
2
3
4
5
6
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
tokenization_test.ipynb Cell 3 in <cell line: 1>()
[2](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=1) if word_index is not None:
[3](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=2) print(word_index)
----> [4](vscode-notebook-cell:/tokenization_test.ipynb#W2sZmlsZQ%3D%3D?line=3) char_span = encoding.word_to_chars(word_index)
File ~/opt/anaconda3/envs/SoS/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:615, in BatchEncoding.word_to_chars(self, batch_or_word_index, word_index, sequence_index)
613 batch_index = 0
614 word_index = batch_or_word_index
--> 615 return CharSpan(*(self._encodings[batch_index].word_to_chars(word_index, sequence_index)))
TypeError: transformers.tokenization_utils_base.CharSpan() argument after * must be an iterable, not NoneType
The word index is valid:
encoding.word_ids()[word_index:word_index+10]
[164, 165, 166, 166, 166, 166, 167, 168, 168, 168]
On further investigation, I have noticed that there is a work-around by validating there is a word-to-token mapping for the word index:
if word_index is not None and encoding.word_to_tokens(word_index) is not None:
[...]
So the underlying issue seems to be that word_to_tokens()
sometimes returns None, although it seems counter-intuitive that there are words in a texts that do not have corresponding tokens.
Expected behavior
BatchEncoding.word_to_tokens()
should not output None
; or it should be documented why/if this can happen.
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Tokenizer — transformers 2.11.0 documentation - Hugging Face
Returns the number of added tokens when encoding a sequence with special tokens. Note. This encodes inputs and checks the number of added...
Read more >Python Script returns unintended "None" after execution of a ...
In python the default return value of a function is None . >>> def func():pass >>> print func() #print or print() prints the...
Read more >bpe - Go Packages
Token); func (b *BPE) WordToTokens(word Word) []tokenizer.Token ... GetVocab returns BPE vocab func (b *BPE) GetVocab() *model.Vocab { ...
Read more >Captioning Chest X Rays with Deep Learning
The CNN Encoder: The model takes in a single raw image and generated a caption y encoded as a sequence of 1 to...
Read more >7.2. codecs — Codec registry and base classes - Read the Docs
Looks up the codec info in the Python codec registry and returns a CodecInfo object as ... CodecInfo (encode, decode, streamreader=None, streamwriter=None, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I suppose you are right. I had some doubts because in my aforementioned original text (long erroneous), this occurred somewhere in the middle of the text. I will try to reproduce, but I guess there might have been special tokens as well due to longer sequences of whitespace and/or punctuation.
I could be wrong but it seems to me that your example uses a template. We can see it by running the following code:
which gives:
Here we can see that the
None
correspond to the “template” tokens'<s>'
and'</s>'
.