question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

--fp causes an issue when running example scripts in distributed mode

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): roberta-large Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • the official example scripts

The tasks I am working on is:

  • Finetuning a LM with run_language_modeling.py and the SST-2 task with run_glue.py
  • my own dataset

To reproduce

If I run either of the following commands, I get the error included below. However, if I remove --fp, everything works normally. Also, if I add --fp, but run it non-distributed, everything works normally. So, it appears there is an issue with my running -fp in a distributed fashion. I haven’t had an issue with this before; so, I’m not sure what the problem is. Any ideas? Thanks in advance.

I installed apex in two different way, but still get the same results.

#Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
    && cd apex \
    && python3 setup.py install --cuda_ext --cpp_ext
Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
    && cd apex \
    && pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
python3 -m torch.distributed.launch --nproc_per_node 2 run_language_modeling.py --output_dir=/ptcc/shared/lm_roberta_20200528_164228 --model_type=roberta --do_train --train_data_file=/ptcc/data/train.txt --do_eval --eval_data_file=/ptcc/data/test.txt --evaluate_during_training --per_gpu_train_batch_size=2 --per_gpu_eval_batch_size=2 --learning_rate=5e-06 --model_name_or_path=roberta-large --mlm --max_steps=120000 --warmup_steps=10000 --save_steps=12000 --seed=42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_164228_tf_logs'
python3 -m torch.distributed.launch --nproc_per_node 2 run_glue.py --model_type roberta --task_name SST-2 --do_train --do_eval --evaluate_during_training --data_dir /ptcc/data/ --per_gpu_train_batch_size 2 --per_gpu_eval_batch_size 2 --learning_rate 1e-06 --output_dir clf_roberta_20200528_162937 --model_name_or_path /ptcc/shared/lm_roberta_20200528_113420 --num_train_epochs 2.0 --save_steps 1000 --seed 42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_162937_tf_logs
ptcc_1  | 05/28/2020 20:30:38 - INFO - transformers.trainer -     Starting fine-tuning.
Epoch:   0%|          | 0/2 [00:00<?, ?it/s]       Traceback (most recent call last):
ptcc_1  |   File "/ptcc/run_glue.py", line 228, in <module>
ptcc_1  |     main()
ptcc_1  |   File "/ptcc/run_glue.py", line 160, in main
ptcc_1  |     model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 470, in train
ptcc_1  |     tr_loss += self._training_step(model, inputs, optimizer)
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in _training_step
ptcc_1  |     scaled_loss.backward()
ptcc_1  |   File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
ptcc_1  |     next(self.gen)
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 127, in scale_loss
ptcc_1  |     should_skip = False if delay_overflow_check else loss_scaler.update_scale()
ptcc_1  |   File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 200, in update_scale
ptcc_1  |     self._has_overflow = self._overflow_buf.item()
ptcc_1  | RuntimeError: CUDA error: an illegal memory access was encountered
ptcc_1  | /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
ptcc_1  |   "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
ptcc_1  |                                                  terminate called after throwing an instance of 'c10::Error'
ptcc_1  |   what():  CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
ptcc_1  | frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f69777f6536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1  | frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f6977a39fbe in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
ptcc_1  | frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f69777e6abd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1  | frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x1d9 (0x7f69c3926ef9 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f69c391c84a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f69c38fb7c2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f69c32be466 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #7: <unknown function> + 0x87146b (0x7f69c38fc46b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #8: <unknown function> + 0x240500 (0x7f69c32cb500 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #9: <unknown function> + 0x24174e (0x7f69c32cc74e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1  | frame #10: /usr/bin/python3() [0x572a27]
ptcc_1  | frame #11: /usr/bin/python3() [0x54eef2]
ptcc_1  | frame #12: /usr/bin/python3() [0x588948]
ptcc_1  | frame #13: /usr/bin/python3() [0x5ad438]
ptcc_1  | frame #14: /usr/bin/python3() [0x5ad44e]
ptcc_1  | frame #15: /usr/bin/python3() [0x5ad44e]
ptcc_1  | frame #16: /usr/bin/python3() [0x56b276]
ptcc_1  | frame #17: PyDict_SetItemString + 0x153 (0x5709f3 in /usr/bin/python3)
ptcc_1  | frame #18: PyImport_Cleanup + 0x76 (0x4f2fc6 in /usr/bin/python3)
ptcc_1  | frame #19: Py_FinalizeEx + 0x5e (0x637e2e in /usr/bin/python3)
ptcc_1  | frame #20: Py_Main + 0x395 (0x638e95 in /usr/bin/python3)
ptcc_1  | frame #21: main + 0xe0 (0x4b0d00 in /usr/bin/python3)
ptcc_1  | frame #22: __libc_start_main + 0xe7 (0x7f69e4727b97 in /lib/x86_64-linux-gnu/libc.so.6)
ptcc_1  | frame #23: _start + 0x2a (0x5b250a in /usr/bin/python3)

Environment info

  • transformers version: 2.10.0
  • Platform: Linux-5.3.0-26-generic-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.5.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Y, 2 Tesla V100-SXM2
  • Using distributed or parallel set-up in script?: Y, 2 Tesla V100-SXM2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
CMobley7commented, Jun 12, 2020

@BramVanroy , I can confirm that the changes made in https://github.com/huggingface/transformers/pull/4728 successfully fix the apex issues with both a single and multiple GPUs. I’ve tested on 3 different machines. All ubuntu 18.04, but with different GPUs sets. 2 Tesla V100-SXM2, 2 P100-SXM2, and 2 Tesla M40. Thanks for your help.

1reaction
CMobley7commented, Jun 2, 2020

Thanks @BramVanroy , you suggestion worked. I really appreciate it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running TLC in Distributed Mode - TLA+
You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model...
Read more >
Distributed Training Randomly Stops During the ... - GitHub
It works for the worker but not for the master. It seems master is stopped in the C code somewhere. All processes are...
Read more >
Distributed training questions - Gluon - Apache MXNet Forum
A general question between modes dist_sync and dist_async : it is my understanding that dist_sync is used as in a single machine training...
Read more >
Running Kafka Connect - Standalone vs Distributed Mode ...
In this post, we'll go through examples of running Kafka Connect in both Standalone and Distributed mode. Distributed mode is recommended when running...
Read more >
How distributed training works in Pytorch - AI Summer
Learn how distributed training works in pytorch: data parallel, distributed data parallel and automatic mixed precision.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found