question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The pytorch example question-answering/run_qa_beam_search.py do not work

See original GitHub issue

Environment info

  • transformers version: git+https://github.com/huggingface/transformers
  • Platform:
  • Python version: 3.8
  • PyTorch version (GPU?): 1.10.0
  • Using GPU in script?: yes

Who can help

@pvl @vanpelt @NielsRogge @sgugger

Models:

  • T5: gsarti/it5-base
  • encoder-decoder models (For example, BlenderBot, BART, Marian, Pegasus, T5, ByT5): gsarti/it5-base
  • Pytorch: 1.10.0

If the model isn’t in the list, ping @LysandreJik who will redirect you to the correct contributor.

HF projects:

Examples:

Information

Model I am using (Bert, XLNet …): T5

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. clone code: git clone https://gitlab.com/nicolalandro/qandatrain.git (in this repo I copy the official files for train and I fix the lib in requirements)
  2. go into the code folder: cd qandatrain
  3. install requirements: pip install -r requirements.txt
  4. clone dataset: git clone https://huggingface.co/datasets/z-uo/squad-it
  5. run the code:
python src/run_qa_beam_search.py \
  --model_name_or_path gsarti/it5-base \
  --tokenizer_name gsarti/it5-base \
  --dataset_name squad \
  --train_file "squad-it/SQuAD_it-train_processed.json" \
  --validation_file "squad-it/SQuAD_it-test_processed.json" \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 3 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir it5-squad
  1. you obtain the following error:
...
Traceback (most recent call last):
  File "src/run_qa_beam_search.py", line 696, in <module>
    main()
  File "src/run_qa_beam_search.py", line 454, in main
    train_dataset = train_dataset.map(
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2036, in map
    return self._map_single(
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 503, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 470, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/fingerprint.py", line 406, in wrapper
    out = func(self, *args, **kwargs)
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2404, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2291, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "/media/mint/Barracuda/Project/qandatrain/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1991, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "src/run_qa_beam_search.py", line 386, in prepare_train_features
    cls_index = input_ids.index(tokenizer.cls_token_id)
ValueError: 32005 is not in list

It seams an error on the tokenizer that do not find some token on the dictionary or into the sentences.

Expected behavior

Train the T5 model for question answering on squad-it and create the trained model files at output_dir

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
nicolalandrocommented, Oct 29, 2021

Perfect with that param The train ended correctly thank you!

0reactions
karthikrangasaicommented, Oct 28, 2021

@nicolalandro I had the same error when writing tests for the script. You should use the --predict_with_generate flag.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Learning PyTorch with Examples
This tutorial introduces the fundamental concepts of PyTorch through self-contained examples. At its core, PyTorch provides two main features:.
Read more >
PyTorch
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Read more >
Start Locally - PyTorch
To install PyTorch via pip, and do have a ROCm-capable system, in the above selector, choose OS: Linux, Package: Pip, Language: Python and...
Read more >
PyTorch 2.0
TorchInductor is a deep learning compiler that generates fast code for multiple accelerators and backends. For NVIDIA GPUs, it uses OpenAI Triton as...
Read more >
Writing Distributed Applications with PyTorch
For the purpose of this tutorial, we will use a single machine and spawn multiple processes using the following template. """run.py:""" #!/usr/bin/env python...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found