tokenizer "is_split_into_words" seems not work
See original GitHub issueI input tokenized list of tokens, but it return different result(not count pad token). It seems tokenize pretokenized tokens, ignoring is_split_into_words
. Please refer to the code below:
sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = False, # Add '[CLS]' and '[SEP]'
max_length = 314, # Pad & truncate all sentences.
padding = 'max_length',
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt',
return_token_type_ids=False,# Return pytorch tensors.
truncation=False,
is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 79
print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']
###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
tokenized, # Sentence to encode.
add_special_tokens = False, # Add '[CLS]' and '[SEP]'
max_length = 314, # Pad & truncate all sentences.
padding = 'max_length',
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt',
return_token_type_ids=False,# Return pytorch tensors.
truncation=False,
is_split_into_words=True)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 114 # it should be 79
print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '#', '#', 'c', 'and', 'its', 'founder', ',', 'jim', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', ',', 'guilty', 'of', 'selling', 'dc', '#', '#', 'c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'after', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', '-', '-', 'also', 'a', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'news', 'in', 'the', 'pipeline', '.']
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >Tokenization for Natural Language Processing
This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words....
Read more >Tokenizer not working - Stack Overflow
I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is...
Read more >Introduction to the tokenizers Package
The most obvious way to tokenize a text is to split the text into words. ... Then each function returns a list with...
Read more >Tokenization in NLP - Kaggle
If the text is split into words, then its called as 'Word Tokenization' and if it's ... Python split method do not consider...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think the tokenizer should support a new kwarg such as:
is_already_tokens=False/True
Hello! I think all of the confusion here may be because you’re expecting
is_split_into_words
to understand that the text was already pre-tokenized. This is not the case, it means that the string was split into words (not tokens), i.e., split on spaces.@HenryPaik1, in your example, your list of words is the following:
Some of these strings are tokens, but not words. Running the encoding method on it once again means that you’re re-tokenizing some of these tokens.
You can see it is the case, as the following token:
became:
I think in your case you’re looking for the method
convert_tokens_to_ids
: your sequence is already tokenized, you only need the IDs. If you’re looking to useencode_plus
because you need padding/trunc/conversion to tensors, etc., then you can simply use it without specifying that the sequence is separated into words. Please be aware that the following code only works on python tokenizers, i.e., slow tokenizers.