Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: char_to_token() is not available when using Python based tokenizers ; XLNetTokenizer and encodings.char_to_token bug ;

See original GitHub issue

Environment info

transformers version: 4.6.1
Platform: Windows
Python version: 3.8.8
PyTorch version (GPU?): 1.8.1 , GPU enabled
Tensorflow version (GPU?): NA
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using (Bert, XLNet …): XLNet , “xlnet-base-cased”

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below) my own modified script, but the issue can be reproduced as given below. encodings.char_to_token(i, answers[i][‘answer_start’]) The error I get is : ValueError: char_to_token() is not available when using Python based tokenizers

This issue is very similar to #9326

The tasks I am working on is:

[SQUAD ] an official GLUE/SQUaD task: (give the name)
A self-curated QA dataset in SQUaD format

Steps to reproduce the behavior: Run the code snippet given below :

import json
from pathlib import Path
from transformers import XLNetTokenizer, XLNetForQuestionAnsweringSimple
import torch

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)


    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers


train_contexts, train_questions, train_answers = read_squad('train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('dev-v2.0.json')


def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx - 1:end_idx - 1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1  # When the gold label is off by one character
        elif context[start_idx - 2:end_idx - 2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2  # When the gold label is off by two characters


add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model_name = "xlnet-base-cased"
tokenizer = XLNetTokenizer.from_pretrained(model_name)
model = XLNetForQuestionAnsweringSimple.from_pretrained(model_name)
model.to(device)

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, max_length= 512)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True, max_length= 512)


def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})


add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)