question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ValueError: char_to_token() is not available when using Python based tokenizers ; XLNetTokenizer and encodings.char_to_token bug ;

See original GitHub issue

Environment info

  • transformers version: 4.6.1
  • Platform: Windows
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.8.1 , GPU enabled
  • Tensorflow version (GPU?): NA
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using (Bert, XLNet …): XLNet , “xlnet-base-cased”

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below) my own modified script, but the issue can be reproduced as given below. encodings.char_to_token(i, answers[i][‘answer_start’]) The error I get is : ValueError: char_to_token() is not available when using Python based tokenizers
  • This issue is very similar to #9326

The tasks I am working on is:

  • [SQUAD ] an official GLUE/SQUaD task: (give the name)
  • A self-curated QA dataset in SQUaD format

Steps to reproduce the behavior: Run the code snippet given below :

import json
from pathlib import Path
from transformers import XLNetTokenizer, XLNetForQuestionAnsweringSimple
import torch

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)


    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers


train_contexts, train_questions, train_answers = read_squad('train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('dev-v2.0.json')


def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx - 1:end_idx - 1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1  # When the gold label is off by one character
        elif context[start_idx - 2:end_idx - 2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2  # When the gold label is off by two characters


add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model_name = "xlnet-base-cased"
tokenizer = XLNetTokenizer.from_pretrained(model_name)
model = XLNetForQuestionAnsweringSimple.from_pretrained(model_name)
model.to(device)

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True, max_length= 512)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True, max_length= 512)


def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})


add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

image

Expected behavior

  • encodings.char_to_token(i, answers[i][‘answer_start’]) should return some value
  • char_to_token should be not none in this case like in other tokenizers

ValueError: char_to_token() is not available when using Python based tokenizers encodings._encoding seems to be None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
cronoikcommented, May 23, 2022

Sorry, I found the reaction of @akar5h very unfriendly and decided to ignore this issue I’ll look into it later.

1reaction
akar5hcommented, Jun 26, 2021

much help

Read more comments on GitHub >

github_iconTop Results From Across the Web

char_to_token() is not available when using Python based ...
[1 fix] Steps to fix this transformers exception: ... Full details: ValueError: char_to_token() is not available when using Python based tokenizers.
Read more >
Word_ids not working with deberta_v2 - 🤗Tokenizers
Currently, I am working on a token classification. When I have tried to use word_ids function during tokenization, it gave me an error....
Read more >
Key Error while fine tunning T5 for summarization with ...
This is because this tokenizer returns an object with the following structure Tokenizer outpu. You have to amend the __getitem__ method of ...
Read more >
tokenizer.py - from typing import List, Optional, Tuple
_encodings:raise ValueError("char_to_token() is not available when using Python basedtokenizers")if char_index is not None:batch_index ...
Read more >
python - How to perform tokenization for tweets in xlnet?
X_train has only one column that contains all tweets. xlnet_model = 'xlnet-large-cased' xlnet_tokenizer = XLNetTokenizer.from_pretrained( ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found