question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.

See original GitHub issue

Environment info

  • transformers version: 4.3.3
  • Platform: Linux-5.4.0-65-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.7.1 (False)
  • Tensorflow version (GPU?): 2.5.0-dev20210225 (False)
  • Using GPU in script?: 2 V100 32GB
  • Using distributed or parallel set-up in script?: parallel

Who can help

@LysandreJik @sgugger @n1t0

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

I used run_ner.py from examples.

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

PoS Tagging task with the datasets: https://github.com/yigit353/turkish-bert-itu/tree/main/imst

To reproduce

Steps to reproduce the behavior:

  1. Converted a BERT TensorFlow 1 checkpoint pre-trained from scratch using a custom corpus and vocabulary with the original Google’s BERT run_pretraining.py via transformers-cli convert
  2. Used the datasets in this repo (I just uploaded them there): https://github.com/yigit353/turkish-bert-itu/tree/main/imst
  3. Used run_ner.py on the dataset with the following code:
python3 "$USER_ROOT/$LIB_DIR/run_ner.py" \
  --task_name=pos \
  --model_name_or_path "$USER_ROOT/$BERT_DIR/$TORCH_DIR" \
  --train_file "$USER_ROOT/$DATA_DIR/tr_imst-ud-train.conllu.json" \
  --validation_file "$USER_ROOT/$DATA_DIR/tr_imst-ud-dev.conllu.json" \
  --output_dir "$USER_ROOT/$DATA_DIR/$OUTPUT_DIR-$SEED" \
  --per_device_train_batch_size=$BATCH_SIZE \
  --num_train_epochs=$NUM_EPOCHS \
  --overwrite_cache=True \
  --do_train \
  --do_eval \
  --seed=$SEED \
  --fp16
  1. It worked good with NER datasets (which is parallel to PoS dataset) here: https://github.com/yigit353/turkish-bert-itu/tree/main/datasets/ner
  2. It also worked with the PyTorch model (both with PoS and NER without errors or warnings): https://huggingface.co/dbmdz/bert-base-turkish-cased

I also receive the following warning for NER and POS datasets: thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66

However, NER task worked nonetheless with this script:

python3 "$USER_ROOT/$LIB_DIR/run_ner.py" \
  --model_name_or_path "$USER_ROOT/$BERT_DIR/$OUT_DIR/$TORCH_OUT_DIR" \
  --train_file "$USER_ROOT/$DATA_DIR/tr-data3/train.json" \
  --validation_file "$USER_ROOT/$DATA_DIR/tr-data3/dev.json" \
  --output_dir "$USER_ROOT/$DATA_DIR/$OUTPUT_DIR-$SEED" \
  --per_device_train_batch_size=$BATCH_SIZE \
  --num_train_epochs=$NUM_EPOCHS \
  --do_train \
  --do_eval \
  --fp16`

Expected behavior

[INFO|trainer.py:837] 2021-02-28 16:04:10,685 >> ***** Running training *****
[INFO|trainer.py:838] 2021-02-28 16:04:10,685 >>   Num examples = 3664
[INFO|trainer.py:839] 2021-02-28 16:04:10,685 >>   Num Epochs = 10
[INFO|trainer.py:840] 2021-02-28 16:04:10,685 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-02-28 16:04:10,685 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:842] 2021-02-28 16:04:10,685 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-02-28 16:04:10,685 >>   Total optimization steps = 1150

  0%|          | 0/1150 [00:00<?, ?it/s]/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
  File "/okyanus/users/ctantug/transformers/examples/token-classification/run_ner.py", line 466, in <module>
    main()
  File "/okyanus/users/ctantug/transformers/examples/token-classification/run_ner.py", line 400, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1302, in training_step
    loss = self.compute_loss(model, inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1334, in compute_loss
    outputs = model(**inputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    for k in out))
  File "<string>", line 7, in __init__
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/file_utils.py", line 1413, in __post_init__
    for element in iterator:
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
    for k in out))
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 230, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.

The stack trace always gives a different error location.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
yigit353commented, Mar 1, 2021

Solved it! Turns out in my config.json (which is copied from another PyTorch checkpoint) should also change the label2id and id2label. That was totally unexpected. In order to match 14 labels I changed the config file as follows:

{
  ...
  "architectures": [
    "BertForTokenClassification"
  ],
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13"
  },
  
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9,
    "LABEL_10": 10, 
    "LABEL_11": 11, 
    "LABEL_12": 12, 
    "LABEL_13": 13
  },
  ...
}

Thank you anyways.

0reactions
yigit353commented, Mar 1, 2021

I have already checked the number of labels first thing. That’s why it surprised me that it is not the problem. I also run the script without eval.

03/01/2021 17:45:18 - INFO - __main__ -   Label list ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PRON', 'PROPN', 'PUNCT', 'VERB', 'X']
03/01/2021 17:45:18 - INFO - __main__ -   Label to id {'ADJ': 0, 'ADP': 1, 'ADV': 2, 'AUX': 3, 'CCONJ': 4, 'DET': 5, 'INTJ': 6, 'NOUN': 7, 'NUM': 8, 'PRON': 9, 'PROPN': 10, 'PUNCT': 11, 'VERB': 12, 'X': 13}
03/01/2021 17:45:18 - INFO - __main__ -   Num labels 14

What might be another cause of this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cuda runtime error ClassNLLCriterion assertion #10 - GitHub
Hi, I am facing the following problem when I attempt to train the network ... float]: block: [0,0,0], thread: [1,0,0] Assertion `t >=...
Read more >
Error: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t ...
I am using an LSTM for multi-class many-to-one prediction (20 classes). When trying to calculate the loss using CrossEnthropyLoss, ...
Read more >
torch - PyTorch RuntimeError: Assertion `cur_target >= 0 ...
It means that your target should be in the range of [0,n) with n-classes. Share.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found