--fp causes an issue when running example scripts in distributed mode
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …):
roberta-large
Language I am using the model on (English, Chinese …):
English
The problem arises when using:
- the official example scripts
The tasks I am working on is:
- Finetuning a LM with
run_language_modeling.py
and the SST-2 task withrun_glue.py
- my own dataset
To reproduce
If I run either of the following commands, I get the error included below. However, if I remove --fp
, everything works normally. Also, if I add --fp
, but run it non-distributed, everything works normally. So, it appears there is an issue with my running -fp
in a distributed fashion. I haven’t had an issue with this before; so, I’m not sure what the problem is. Any ideas? Thanks in advance.
I installed apex in two different way, but still get the same results.
#Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
&& cd apex \
&& python3 setup.py install --cuda_ext --cpp_ext
Install package required for fp16 computations
RUN git clone https://github.com/NVIDIA/apex.git \
&& cd apex \
&& pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
python3 -m torch.distributed.launch --nproc_per_node 2 run_language_modeling.py --output_dir=/ptcc/shared/lm_roberta_20200528_164228 --model_type=roberta --do_train --train_data_file=/ptcc/data/train.txt --do_eval --eval_data_file=/ptcc/data/test.txt --evaluate_during_training --per_gpu_train_batch_size=2 --per_gpu_eval_batch_size=2 --learning_rate=5e-06 --model_name_or_path=roberta-large --mlm --max_steps=120000 --warmup_steps=10000 --save_steps=12000 --seed=42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_164228_tf_logs'
python3 -m torch.distributed.launch --nproc_per_node 2 run_glue.py --model_type roberta --task_name SST-2 --do_train --do_eval --evaluate_during_training --data_dir /ptcc/data/ --per_gpu_train_batch_size 2 --per_gpu_eval_batch_size 2 --learning_rate 1e-06 --output_dir clf_roberta_20200528_162937 --model_name_or_path /ptcc/shared/lm_roberta_20200528_113420 --num_train_epochs 2.0 --save_steps 1000 --seed 42 --fp16 --logging_dir=/ptcc/shared/roberta_20200528_162937_tf_logs
ptcc_1 | 05/28/2020 20:30:38 - INFO - transformers.trainer - Starting fine-tuning.
Epoch: 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last):
ptcc_1 | File "/ptcc/run_glue.py", line 228, in <module>
ptcc_1 | main()
ptcc_1 | File "/ptcc/run_glue.py", line 160, in main
ptcc_1 | model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
ptcc_1 | File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 470, in train
ptcc_1 | tr_loss += self._training_step(model, inputs, optimizer)
ptcc_1 | File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 577, in _training_step
ptcc_1 | scaled_loss.backward()
ptcc_1 | File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
ptcc_1 | next(self.gen)
ptcc_1 | File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 127, in scale_loss
ptcc_1 | should_skip = False if delay_overflow_check else loss_scaler.update_scale()
ptcc_1 | File "/usr/local/lib/python3.6/dist-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 200, in update_scale
ptcc_1 | self._has_overflow = self._overflow_buf.item()
ptcc_1 | RuntimeError: CUDA error: an illegal memory access was encountered
ptcc_1 | /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
ptcc_1 | "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
ptcc_1 | terminate called after throwing an instance of 'c10::Error'
ptcc_1 | what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
ptcc_1 | frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f69777f6536 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1 | frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f6977a39fbe in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10_cuda.so)
ptcc_1 | frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f69777e6abd in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
ptcc_1 | frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x1d9 (0x7f69c3926ef9 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f69c391c84a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f69c38fb7c2 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f69c32be466 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #7: <unknown function> + 0x87146b (0x7f69c38fc46b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #8: <unknown function> + 0x240500 (0x7f69c32cb500 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #9: <unknown function> + 0x24174e (0x7f69c32cc74e in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
ptcc_1 | frame #10: /usr/bin/python3() [0x572a27]
ptcc_1 | frame #11: /usr/bin/python3() [0x54eef2]
ptcc_1 | frame #12: /usr/bin/python3() [0x588948]
ptcc_1 | frame #13: /usr/bin/python3() [0x5ad438]
ptcc_1 | frame #14: /usr/bin/python3() [0x5ad44e]
ptcc_1 | frame #15: /usr/bin/python3() [0x5ad44e]
ptcc_1 | frame #16: /usr/bin/python3() [0x56b276]
ptcc_1 | frame #17: PyDict_SetItemString + 0x153 (0x5709f3 in /usr/bin/python3)
ptcc_1 | frame #18: PyImport_Cleanup + 0x76 (0x4f2fc6 in /usr/bin/python3)
ptcc_1 | frame #19: Py_FinalizeEx + 0x5e (0x637e2e in /usr/bin/python3)
ptcc_1 | frame #20: Py_Main + 0x395 (0x638e95 in /usr/bin/python3)
ptcc_1 | frame #21: main + 0xe0 (0x4b0d00 in /usr/bin/python3)
ptcc_1 | frame #22: __libc_start_main + 0xe7 (0x7f69e4727b97 in /lib/x86_64-linux-gnu/libc.so.6)
ptcc_1 | frame #23: _start + 0x2a (0x5b250a in /usr/bin/python3)
Environment info
transformers
version: 2.10.0- Platform: Linux-5.3.0-26-generic-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.5.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Y, 2 Tesla V100-SXM2
- Using distributed or parallel set-up in script?: Y, 2 Tesla V100-SXM2
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (1 by maintainers)
Top Results From Across the Web
Running TLC in Distributed Mode - TLA+
You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model...
Read more >Distributed Training Randomly Stops During the ... - GitHub
It works for the worker but not for the master. It seems master is stopped in the C code somewhere. All processes are...
Read more >Distributed training questions - Gluon - Apache MXNet Forum
A general question between modes dist_sync and dist_async : it is my understanding that dist_sync is used as in a single machine training...
Read more >Running Kafka Connect - Standalone vs Distributed Mode ...
In this post, we'll go through examples of running Kafka Connect in both Standalone and Distributed mode. Distributed mode is recommended when running...
Read more >How distributed training works in Pytorch - AI Summer
Learn how distributed training works in pytorch: data parallel, distributed data parallel and automatic mixed precision.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@BramVanroy , I can confirm that the changes made in https://github.com/huggingface/transformers/pull/4728 successfully fix the apex issues with both a single and multiple GPUs. I’ve tested on 3 different machines. All ubuntu 18.04, but with different GPUs sets. 2 Tesla V100-SXM2, 2 P100-SXM2, and 2 Tesla M40. Thanks for your help.
Thanks @BramVanroy , you suggestion worked. I really appreciate it.