ValueError: char_to_token() is not available when using Python based tokenizers ; XLNetTokenizer and encodings.char_to_token bug ;
See original GitHub issueEnvironment info
- transformers version: 4.6.1
- Platform: Windows
- Python version: 3.8.8
- PyTorch version (GPU?): 1.8.1 , GPU enabled
- Tensorflow version (GPU?): NA
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
@patrickvonplaten @LysandreJik
Information
Model I am using (Bert, XLNet …): XLNet , “xlnet-base-cased”
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below) my own modified script, but the issue can be reproduced as given below. encodings.char_to_token(i, answers[i][‘answer_start’]) The error I get is : ValueError: char_to_token() is not available when using Python based tokenizers
- This issue is very similar to #9326
The tasks I am working on is:
- [SQUAD ] an official GLUE/SQUaD task: (give the name)
- A self-curated QA dataset in SQUaD format
Steps to reproduce the behavior: Run the code snippet given below :
import json
from pathlib import Path
from transformers import XLNetTokenizer, XLNetForQuestionAnsweringSimple
import torch
def read_squad(path):
path = Path(path)
with open(path, 'rb') as f:
squad_dict = json.load(f)
contexts = []
questions = []
answers = []
for group in squad_dict['data']:
for passage in group['paragraphs']:
context = passage['context']
for qa in passage['qas']:
question = qa['question']
for answer in qa['answers']:
contexts.append(context)
questions.append(question)
answers.append(answer)
return contexts, questions, answers
train_contexts, train_questions, train_answers = read_squad('train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('dev-v2.0.json')
def add_end_idx(answers, contexts):
for answer, context in zip(answers, contexts):
gold_text = answer['text']
start_idx = answer['answer_start']
end_idx = start_idx + len(gold_text)
# sometimes squad answers are off by a character or two – fix this
if context[start_idx:end_idx] == gold_text:
answer['answer_end'] = end_idx
elif context[start_idx - 1:end_idx - 1] == gold_text:
answer['answer_start'] = start_idx - 1
answer['answer_end'] = end_idx - 1 # When the gold label is off by one character
elif context[start_idx - 2:end_idx - 2] == gold_text:
answer['answer_start'] = start_idx - 2
answer['answer_end'] = end_idx - 2 # When the gold label is off by two characters
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_name = "xlnet-base-cased"
tokenizer = XLNetTokenizer.from_pretrained(model_name)
model = XLNetForQuestionAnsweringSimple.from_pretrained(model_name)
model.to(device)
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, max_length= 512)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True, max_length= 512)
def add_token_positions(encodings, answers):
start_positions = []
end_positions = []
for i in range(len(answers)):
start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
# if None, the answer passage has been truncated
if start_positions[-1] is None:
start_positions[-1] = tokenizer.model_max_length
if end_positions[-1] is None:
end_positions[-1] = tokenizer.model_max_length
encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)
Expected behavior
- encodings.char_to_token(i, answers[i][‘answer_start’]) should return some value
- char_to_token should be not none in this case like in other tokenizers
ValueError: char_to_token() is not available when using Python based tokenizers encodings._encoding seems to be None
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (4 by maintainers)
Top Results From Across the Web
char_to_token() is not available when using Python based ...
[1 fix] Steps to fix this transformers exception: ... Full details: ValueError: char_to_token() is not available when using Python based tokenizers.
Read more >Word_ids not working with deberta_v2 - 🤗Tokenizers
Currently, I am working on a token classification. When I have tried to use word_ids function during tokenization, it gave me an error....
Read more >Key Error while fine tunning T5 for summarization with ...
This is because this tokenizer returns an object with the following structure Tokenizer outpu. You have to amend the __getitem__ method of ...
Read more >tokenizer.py - from typing import List, Optional, Tuple
_encodings:raise ValueError("char_to_token() is not available when using Python basedtokenizers")if char_index is not None:batch_index ...
Read more >python - How to perform tokenization for tweets in xlnet?
X_train has only one column that contains all tweets. xlnet_model = 'xlnet-large-cased' xlnet_tokenizer = XLNetTokenizer.from_pretrained( ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry, I found the reaction of @akar5h very unfriendly and decided to ignore this issue I’ll look into it later.
much help