Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

--fp causes an issue when running example scripts in distributed mode

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): roberta-large Language I am using the model on (English, Chinese …): English

The problem arises when using:

the official example scripts

The tasks I am working on is:

Finetuning a LM with run_language_modeling.py and the SST-2 task with run_glue.py
my own dataset

To reproduce

If I run either of the following commands, I get the error included below. However, if I remove --fp, everything works normally. Also, if I add --fp, but run it non-distributed, everything works normally. So, it appears there is an issue with my running -fp in a distributed fashion. I haven’t had an issue with this before; so, I’m not sure what the problem is. Any ideas? Thanks in advance.

I installed apex in two different way, but still get the same results.

#Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
    && cd apex \
    && python3 setup.py install --cuda_ext --cpp_ext

Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
    && cd apex \
    && pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

python3 -m torch.distributed.launch --nproc_per_node 2 run_language_modeling.py --output_dir=/ptcc/shared/lm_roberta_20200528_164228 --model_type=roberta --do_train --train_data_file=/ptcc/data/train.txt --do_eval --eval_data_file=/ptcc/data/test.txt --evaluate_during_training --per_gpu_train_batch_size=2 --per_gpu_eval_batch_size=2 --learning_rate=5e-06 --model_name_or_path=roberta-large --mlm --max_steps=120000 --warmup_steps=10000 --save_steps=12000 --seed=42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_164228_tf_logs'

python3 -m torch.distributed.launch --nproc_per_node 2 run_glue.py --model_type roberta --task_name SST-2 --do_train --do_eval --evaluate_during_training --data_dir /ptcc/data/ --per_gpu_train_batch_size 2 --per_gpu_eval_batch_size 2 --learning_rate 1e-06 --output_dir clf_roberta_20200528_162937 --model_name_or_path /ptcc/shared/lm_roberta_20200528_113420 --num_train_epochs 2.0 --save_steps 1000 --seed 42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_162937_tf_logs

ptcc_1  | 05/28/2020 20:30:38 - INFO - transformers.trainer -     Starting fine-tuning.
Epoch:   0%|          | 0/2 [00:00<?, ?it/s]       Traceback (most recent call last):
ptcc_1  |   File "/ptcc/run_glue.py", line 228, in <module>
ptcc_1  |     main()
ptcc_1  |   File "/ptcc/run_glue.py", line 160, in main
ptcc_1  |     model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 470, in train
ptcc_1  |     tr_loss += self._training_step(model, inputs, optimizer)
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in _training_step
ptcc_1  |     scaled_loss.backward()
ptcc_1  |   File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
ptcc_1  |     next(self.gen)
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 127, in scale_loss
ptcc_1  |     should_skip = False if delay_overflow_check else loss_scaler.update_scale()
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 200, in update_scale
ptcc_1  |     self._has_overflow = self._overflow_buf.item()
ptcc_1  | RuntimeError: CUDA error: an illegal memory access was encountered
ptcc_1  | /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
ptcc_1  |   "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
ptcc_1  |                                                  terminate called after throwing an instance of 'c10::Error'
ptcc_1  |   what():  CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
ptcc_1  | frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f69777f6536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1  | frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f6977a39fbe in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
ptcc_1  | frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f69777e6abd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1  | frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x1d9 (0x7f69c3926ef9 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f69c391c84a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f69c38fb7c2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f69c32be466 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #7: <unknown function> + 0x87146b (0x7f69c38fc46b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #8: <unknown function> + 0x240500 (0x7f69c32cb500 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #9: <unknown function> + 0x24174e (0x7f69c32cc74e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #10: /usr/bin/python3() [0x572a27]
ptcc_1  | frame #11: /usr/bin/python3() [0x54eef2]
ptcc_1  | frame #12: /usr/bin/python3() [0x588948]
ptcc_1  | frame #13: /usr/bin/python3() [0x5ad438]
ptcc_1  | frame #14: /usr/bin/python3() [0x5ad44e]
ptcc_1  | frame #15: /usr/bin/python3() [0x5ad44e]
ptcc_1  | frame #16: /usr/bin/python3() [0x56b276]
ptcc_1  | frame #17: PyDict_SetItemString + 0x153 (0x5709f3 in /usr/bin/python3)
ptcc_1  | frame #18: PyImport_Cleanup + 0x76 (0x4f2fc6 in /usr/bin/python3)
ptcc_1  | frame #19: Py_FinalizeEx + 0x5e (0x637e2e in /usr/bin/python3)
ptcc_1  | frame #20: Py_Main + 0x395 (0x638e95 in /usr/bin/python3)
ptcc_1  | frame #21: main + 0xe0 (0x4b0d00 in /usr/bin/python3)
ptcc_1  | frame #22: __libc_start_main + 0xe7 (0x7f69e4727b97 in /lib/x86_64-linux-gnu/libc.so.6)
ptcc_1  | frame #23: _start + 0x2a (0x5b250a in /usr/bin/python3)

Environment info

transformers version: 2.10.0
Platform: Linux-5.3.0-26-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Y, 2 Tesla V100-SXM2
Using distributed or parallel set-up in script?: Y, 2 Tesla V100-SXM2

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (1 by maintainers)

Top GitHub Comments

1reaction

CMobley7commented, Jun 12, 2020

@BramVanroy , I can confirm that the changes made in https://github.com/huggingface/transformers/pull/4728 successfully fix the apex issues with both a single and multiple GPUs. I’ve tested on 3 different machines. All ubuntu 18.04, but with different GPUs sets. 2 Tesla V100-SXM2, 2 P100-SXM2, and 2 Tesla M40. Thanks for your help.

1reaction

CMobley7commented, Jun 2, 2020

Thanks @BramVanroy , you suggestion worked. I really appreciate it.

Top Results From Across the Web

Running TLC in Distributed Mode - TLA+

You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model...

Distributed Training Randomly Stops During the ... - GitHub

It works for the worker but not for the master. It seems master is stopped in the C code somewhere. All processes are...

Distributed training questions - Gluon - Apache MXNet Forum

A general question between modes dist_sync and dist_async : it is my understanding that dist_sync is used as in a single machine training...

Running Kafka Connect - Standalone vs Distributed Mode ...

In this post, we'll go through examples of running Kafka Connect in both Standalone and Distributed mode. Distributed mode is recommended when running...

How distributed training works in Pytorch - AI Summer

Learn how distributed training works in pytorch: data parallel, distributed data parallel and automatic mixed precision.