Can't use padding in Wav2Vec2Tokenizer. TypeError: '<' not supported between instances of 'NoneType' and 'int'.
See original GitHub issueQuestions & Help
Details
I’m trying to get a Tensor of labels from a text in order to train a Wav2Vec2ForCTC from scratch but apparently pad_token_id is set to NoneType, even though I’ve set a pad_token in my Tokenizer.
This is my code:
# Generating the Processor
from transformers import Wav2Vec2CTCTokenizer
from transformers import Wav2Vec2FeatureExtractor
from transformers import Wav2Vec2Processor
tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token = "[UNK]", pad_token = "[PAD]", word_delimiter_token="|")
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=sampling_rate, padding_value=0.0, do_normalize=True, return_attention_mask=False)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
with processor.as_target_processor():
batch["labels"] = processor(batch["text"], padding = True, max_length = 1000, return_tensors="pt").input_ids
Error message is this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-45831c0137f6> in <module>
9
10 # Processing
---> 11 data = prepare(data)
12 data["input"] = data["input"][0]
13 data["input"] = np.array([inp.T.reshape(12*4096) for inp in data["input"]])
<ipython-input-4-aaba15f24a61> in prepare(batch)
29 # Texts
30 with processor.as_target_processor():
---> 31 batch["labels"] = processor(batch["text"], padding = True, max_length = 1000, return_tensors="pt").input_ids
32
33 return batch
~/anaconda3/lib/python3.8/site-packages/transformers/models/wav2vec2/processing_wav2vec2.py in __call__(self, *args, **kwargs)
115 the above two methods for more information.
116 """
--> 117 return self.current_processor(*args, **kwargs)
118
119 def pad(self, *args, **kwargs):
~/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2252 if is_batched:
2253 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2254 return self.batch_encode_plus(
2255 batch_text_or_text_pairs=batch_text_or_text_pairs,
2256 add_special_tokens=add_special_tokens,
~/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2428
2429 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 2430 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
2431 padding=padding,
2432 truncation=truncation,
~/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py in _get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2149
2150 # Test if we have a padding token
-> 2151 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
2152 raise ValueError(
2153 "Asking to pad but the tokenizer does not have a padding token. "
TypeError: '<' not supported between instances of 'NoneType' and 'int'
I’ve also tried seting the pad_token with tokenizer.pad_token = “[PAD]”. It didn’t work. Does anyone know what I’m doing wrong? Thanks.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:6 (2 by maintainers)
Top Results From Across the Web
' not supported between instances of 'NoneType' and 'int' - ...
In Python 3 such comparisons raise a TypeError : ... TypeError: '>' not supported between instances of 'NoneType' and 'int' >>> None <...
Read more >huggingface typeerror: '>' not supported between instances ...
Can't use padding in Wav2Vec2Tokenizer. TypeError: '<' not supported ... ... TypeError: '<' not supported between instances of 'NoneType' and 'int'. #12824.
Read more >'>' not supported between instances of 'str' and 'int'
typeerror : '>' not supported between instances of 'str' and 'int'. Strings and integers cannot be compared using comparison operators. This is ...
Read more >TypeError: '>' not supported between instances of ' ...
... exception=TypeError("'>' not supported between instances of 'NoneType' and 'int'")> > Traceback (most recent call last): > File ...
Read more >'' not supported between instances of 'NoneType' and 'float ...
Pandas : TypeError : '' not supported between instances of ' NoneType ' and 'float' [ Beautify Your Computer ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@patrickvonplaten @patrickvonplaten I have the same error here, any help?
I am getting the same error when I am trying to use gpt2 tokenizer. I am trying to fine tune bert2gpt2 encoder decoder model with your training scripts here: https://huggingface.co/patrickvonplaten/bert2gpt2-cnn_dailymail-fp16
I tried transformers 4.15.0 and 4.6.0 both of them didn’t work.