question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

tokenizer "is_split_into_words" seems not work

See original GitHub issue

I input tokenized list of tokens, but it return different result(not count pad token). It seems tokenize pretokenized tokens, ignoring is_split_into_words. Please refer to the code below:

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=True)

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 114 # it should be 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '#', '#', 'c', 'and', 'its', 'founder', ',', 'jim', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', ',', 'guilty', 'of', 'selling', 'dc', '#', '#', 'c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'after', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', '-', '-', 'also', 'a', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'news', 'in', 'the', 'pipeline', '.']

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
zhangzhenyu13commented, Apr 13, 2022

I think the tokenizer should support a new kwarg such as: is_already_tokens=False/True

3reactions
LysandreJikcommented, Nov 30, 2020

Hello! I think all of the confusion here may be because you’re expecting is_split_into_words to understand that the text was already pre-tokenized. This is not the case, it means that the string was split into words (not tokens), i.e., split on spaces.

@HenryPaik1, in your example, your list of words is the following:

['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

Some of these strings are tokens, but not words. Running the encoding method on it once again means that you’re re-tokenizing some of these tokens.

You can see it is the case, as the following token:

 [..., '##c', ...]

became:

[..., '#', '#', 'c', ...]

I think in your case you’re looking for the method convert_tokens_to_ids: your sequence is already tokenized, you only need the IDs. If you’re looking to use encode_plus because you need padding/trunc/conversion to tensors, etc., then you can simply use it without specifying that the sequence is separated into words. Please be aware that the following code only works on python tokenizers, i.e., slow tokenizers.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
#80 

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
               )

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
# 80
Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
Tokenization for Natural Language Processing
This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words....
Read more >
Tokenizer not working - Stack Overflow
I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is...
Read more >
Introduction to the tokenizers Package
The most obvious way to tokenize a text is to split the text into words. ... Then each function returns a list with...
Read more >
Tokenization in NLP - Kaggle
If the text is split into words, then its called as 'Word Tokenization' and if it's ... Python split method do not consider...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found