pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
See original GitHub issueEnvironment info
transformers
version: 4.3.3- Platform: Linux-5.4.0-65-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.9
- PyTorch version (GPU?): 1.7.1 (False)
- Tensorflow version (GPU?): 2.5.0-dev20210225 (False)
- Using GPU in script?: 2 V100 32GB
- Using distributed or parallel set-up in script?: parallel
Who can help
Information
Model I am using (Bert, XLNet …): Bert
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
I used run_ner.py
from examples.
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
PoS Tagging task with the datasets: https://github.com/yigit353/turkish-bert-itu/tree/main/imst
To reproduce
Steps to reproduce the behavior:
- Converted a BERT TensorFlow 1 checkpoint pre-trained from scratch using a custom corpus and vocabulary with the original Google’s BERT run_pretraining.py via
transformers-cli convert
- Used the datasets in this repo (I just uploaded them there): https://github.com/yigit353/turkish-bert-itu/tree/main/imst
- Used run_ner.py on the dataset with the following code:
python3 "$USER_ROOT/$LIB_DIR/run_ner.py" \
--task_name=pos \
--model_name_or_path "$USER_ROOT/$BERT_DIR/$TORCH_DIR" \
--train_file "$USER_ROOT/$DATA_DIR/tr_imst-ud-train.conllu.json" \
--validation_file "$USER_ROOT/$DATA_DIR/tr_imst-ud-dev.conllu.json" \
--output_dir "$USER_ROOT/$DATA_DIR/$OUTPUT_DIR-$SEED" \
--per_device_train_batch_size=$BATCH_SIZE \
--num_train_epochs=$NUM_EPOCHS \
--overwrite_cache=True \
--do_train \
--do_eval \
--seed=$SEED \
--fp16
- It worked good with NER datasets (which is parallel to PoS dataset) here: https://github.com/yigit353/turkish-bert-itu/tree/main/datasets/ner
- It also worked with the PyTorch model (both with PoS and NER without errors or warnings): https://huggingface.co/dbmdz/bert-base-turkish-cased
I also receive the following warning for NER and POS datasets:
thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
However, NER task worked nonetheless with this script:
python3 "$USER_ROOT/$LIB_DIR/run_ner.py" \
--model_name_or_path "$USER_ROOT/$BERT_DIR/$OUT_DIR/$TORCH_OUT_DIR" \
--train_file "$USER_ROOT/$DATA_DIR/tr-data3/train.json" \
--validation_file "$USER_ROOT/$DATA_DIR/tr-data3/dev.json" \
--output_dir "$USER_ROOT/$DATA_DIR/$OUTPUT_DIR-$SEED" \
--per_device_train_batch_size=$BATCH_SIZE \
--num_train_epochs=$NUM_EPOCHS \
--do_train \
--do_eval \
--fp16`
Expected behavior
[INFO|trainer.py:837] 2021-02-28 16:04:10,685 >> ***** Running training *****
[INFO|trainer.py:838] 2021-02-28 16:04:10,685 >> Num examples = 3664
[INFO|trainer.py:839] 2021-02-28 16:04:10,685 >> Num Epochs = 10
[INFO|trainer.py:840] 2021-02-28 16:04:10,685 >> Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-02-28 16:04:10,685 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:842] 2021-02-28 16:04:10,685 >> Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-02-28 16:04:10,685 >> Total optimization steps = 1150
0%| | 0/1150 [00:00<?, ?it/s]/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "/okyanus/users/ctantug/transformers/examples/token-classification/run_ner.py", line 466, in <module>
main()
File "/okyanus/users/ctantug/transformers/examples/token-classification/run_ner.py", line 400, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 940, in train
tr_loss += self.training_step(model, inputs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1302, in training_step
loss = self.compute_loss(model, inputs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/trainer.py", line 1334, in compute_loss
outputs = model(**inputs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
return self.gather(outputs, self.output_device)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
return gather(outputs, output_device, dim=self.dim)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "<string>", line 7, in __init__
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/transformers/file_utils.py", line 1413, in __post_init__
for element in iterator:
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in <genexpr>
for k in out))
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/okyanus/users/ctantug/.conda/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 230, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [16,0,0] Assertion `t >= 0 && t < n_classes` failed.
The stack trace always gives a different error location.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Cuda runtime error ClassNLLCriterion assertion #10 - GitHub
Hi, I am facing the following problem when I attempt to train the network ... float]: block: [0,0,0], thread: [1,0,0] Assertion `t >=...
Read more >Error: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t ...
I am using an LSTM for multi-class many-to-one prediction (20 classes). When trying to calculate the loss using CrossEnthropyLoss, ...
Read more >torch - PyTorch RuntimeError: Assertion `cur_target >= 0 ...
It means that your target should be in the range of [0,n) with n-classes. Share.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Solved it! Turns out in my config.json (which is copied from another PyTorch checkpoint) should also change the
label2id
andid2label
. That was totally unexpected. In order to match 14 labels I changed the config file as follows:Thank you anyways.
I have already checked the number of labels first thing. That’s why it surprised me that it is not the problem. I also run the script without eval.
What might be another cause of this?