Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

tokenizer "is_split_into_words" seems not work

See original GitHub issue

I input tokenized list of tokens, but it return different result(not count pad token). It seems tokenize pretokenized tokens, ignoring is_split_into_words. Please refer to the code below:

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=True)

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 114 # it should be 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '#', '#', 'c', 'and', 'its', 'founder', ',', 'jim', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', ',', 'guilty', 'of', 'selling', 'dc', '#', '#', 'c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'after', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', '-', '-', 'also', 'a', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'news', 'in', 'the', 'pipeline', '.']

Issue Analytics

State:
Created 3 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

3reactions

zhangzhenyu13commented, Apr 13, 2022

I think the tokenizer should support a new kwarg such as: is_already_tokens=False/True

3reactions

LysandreJikcommented, Nov 30, 2020

Hello! I think all of the confusion here may be because you’re expecting is_split_into_words to understand that the text was already pre-tokenized. This is not the case, it means that the string was split into words (not tokens), i.e., split on spaces.

@HenryPaik1, in your example, your list of words is the following:

['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

Some of these strings are tokens, but not words. Running the encoding method on it once again means that you’re re-tokenizing some of these tokens.

You can see it is the case, as the following token:

 [..., '##c', ...]

became:

[..., '#', '#', 'c', ...]

I think in your case you’re looking for the method convert_tokens_to_ids: your sequence is already tokenized, you only need the IDs. If you’re looking to use encode_plus because you need padding/trunc/conversion to tensors, etc., then you can simply use it without specifying that the sequence is separated into words. Please be aware that the following code only works on python tokenizers, i.e., slow tokenizers.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
#80 

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
               )

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
# 80