[BUG] Unable to train on multiple GPUs in EC2 Terminal
See original GitHub issue- I have checked that this bug exists on the latest stable version of AutoGluon
- and/or I have checked that this bug exists on the latest mainline of AutoGluon via source installation
Describe the bug
This error may have the same root cause as in Issue #1650.
Autogluon 0.4.3 AutoMMPredictor training on P3dn.24xlarge instance in EC2 terminal, with env.num_gpus: -1/2/3/… setting. I get an error in spawning multiprocessing, but it works fine with env.num_gpus: 1. Also if put the codes under if __name__ == "__main__"
, it works fine.
Expected behavior
Training with multiple GPUs in a script without if __name__ == "__main__"
.
To Reproduce EC2 P3dn.24xlarge python 3.8 conda create --name <env_name> python=3.8 bash full_install.sh python automm_distillation.py
code with Runtime Error:
from autogluon.text.automm import AutoMMPredictor
from datasets import load_dataset
dataset = load_dataset('glue', 'mnli')
train_dataset = dataset['train']
train_df = train_dataset.to_pandas()
train_df = train_df.drop('idx', axis=1)
valid_df = dataset['validation_matched'].to_pandas()
test_df = dataset['test_matched'].to_pandas()
predictor = AutoMMPredictor(label='label')
predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)
predictions = predictor.predict(valid_df)
prediction_prob = predictor.predict_proba(valid_df)
from autogluon.text.automm.constants import (
MODEL,
DATA,
OPTIMIZATION,
ENVIRONMENT,
DISTILLER
)
student_predictor = AutoMMPredictor(label="label")
config = {
MODEL: f"fusion_mlp_image_text_tabular",
DATA: "default",
DISTILLER: 'default',
OPTIMIZATION: "adamw",
ENVIRONMENT: "default",
}
student_predictor.fit(
train_df,
config=config,
hyperparameters={'env.num_gpus': -1, 'optimization.max_epochs': 5,
'model.hf_text.checkpoint_name': 'huawei-noah/TinyBERT_General_4L_312D'},
teacher_predictor=predictor,
time_limit=10
)
student_predictions = student_predictor.predict(valid_df)
Code that works Fine:
from autogluon.text.automm import AutoMMPredictor
from datasets import load_dataset
#
if __name__ == '__main__':
dataset = load_dataset('glue', 'mnli')
train_dataset = dataset['train']
train_df = train_dataset.to_pandas()
train_df = train_df.drop('idx', axis=1)
valid_df = dataset['validation_matched'].to_pandas()
test_df = dataset['test_matched'].to_pandas()
predictor = AutoMMPredictor(label='label')
predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)
predictions = predictor.predict(valid_df)
prediction_prob = predictor.predict_proba(valid_df)
from autogluon.text.automm.constants import (
MODEL,
DATA,
OPTIMIZATION,
ENVIRONMENT,
DISTILLER
)
student_predictor = AutoMMPredictor(label="label")
config = {
MODEL: f"fusion_mlp_image_text_tabular",
DATA: "default",
DISTILLER: 'default',
OPTIMIZATION: "adamw",
ENVIRONMENT: "default",
}
student_predictor.fit(
train_df,
config=config,
hyperparameters={'env.num_gpus': -1, 'optimization.max_epochs': 5,
'model.hf_text.checkpoint_name': 'huawei-noah/TinyBERT_General_4L_312D'},
teacher_predictor=predictor,
time_limit=10
)
student_predictions = student_predictor.predict(valid_df)
Screenshots Error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/media/code/autogluon/examples/automm/test_distillation.py", line 15, in <module>
predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)
File "/media/code/autogluon/text/src/autogluon/text/automm/predictor.py", line 469, in fit
self._fit(**_fit_args)
File "/media/code/autogluon/text/src/autogluon/text/automm/predictor.py", line 992, in _fit
trainer.fit(
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
process.start()
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Installed Versions 0.4.3
INSTALLED VERSIONS
------------------
date : 2022-06-14
time : 23:01:00.752658
python : 3.8.13.final.0
OS : Linux
OS-release : 5.4.0-1072-aws
Version : #77~18.04.1-Ubuntu SMP Thu Apr 7 21:38:47 UTC 2022
machine : x86_64
processor : x86_64
num_cores : 96
cpu_ram_mb : 765710
cuda version : 11.450.142.00
num_gpus : 8
gpu_ram_mb : [25846, 22929, 21363, 22455, 21429, 23105, 20799, 16888]
avail_disk_size_mb : 24049
autogluon.common : 0.4.3b20220614
autogluon.core : 0.4.3b20220614
autogluon.features : 0.4.3b20220614
autogluon.tabular : 0.4.3b20220614
autogluon.text : 0.4.3b20220614
autogluon.timeseries : 0.4.3b20220614
autogluon.vision : 0.4.3b20220614
autogluon_contrib_nlp: 0.0.1
boto3 : 1.24.7
catboost : 1.0.6
dask : 2021.11.2
distributed : 2021.11.2
fairscale : 0.4.6
fastai : 2.5.6
gluoncv : 0.11.0
gluonts : 0.9.4
hyperopt : 0.2.7
lightgbm : 3.3.2
matplotlib : 3.5.2
mxnet : 1.9.1
networkx : 2.8.3
nlpaug : 1.1.10
nltk : 3.7
nptyping : 1.4.4
numpy : 1.22.4
omegaconf : 2.1.2
pandas : 1.3.5
PIL : 9.0.1
protobuf : None
psutil : 5.8.0
pytorch_lightning : 1.5.10
ray : 1.12.1
ray-lightning : None
requests : 2.28.0
scipy : 1.7.3
sentencepiece : None
setuptools : 59.5.0
skimage : 0.19.3
sklearn : 1.0.2
smart_open : 5.2.1
timm : 0.5.4
torch : 1.10.2+cu102
torchmetrics : 0.7.3
tqdm : 4.64.0
transformers : 4.16.2
xgboost : 1.4.2
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
[BUG] Unable to train on multiple GPUs in EC2 Terminal #1821
24xlarge instance in EC2 terminal, with env.num_gpus: -1/2/3/... setting. I get an error in spawning multiprocessing, but it works fine with env ...
Read more >Troubleshoot connecting to your instance - AWS Documentation
Common causes for connection issues; Error connecting to your instance: Connection timed out; Error: unable to load key ... Expecting: ANY PRIVATE KEY ......
Read more >Problems with multi-gpus - MATLAB Answers - MathWorks
I have no problem training with a single gpu, but when I try to train with multiple gpus, matlab generates the following error:...
Read more >Multi-GPU Training - YOLOv5 Documentation
Multi-GPU Training. This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s).
Read more >How to set up a GPU instance for machine learning on AWS
A solution to this problem is using a Graphics Processing Unit (GPU). Training deep learning models involves multiple matrix multiplications ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Would you try to add
'env.strategy': 'ddp'
? Key in https://github.com/awslabs/autogluon/blob/1543041c1e924f2bdb697be64ab61cd839cb7838/text/src/autogluon/text/automm/configs/environment/default.yaml#L14You can enable that by changing
predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)
toHello, I’m a beginner and have a similar problem when I run multiple Gpus on Windows using ‘env.strategy’: ‘ddp’ and if I set ‘env.num_gpus’: -1, the error is ’ RuntimeError: Distributed package doesn’t have NCCL built in '. If you know the answer, please teach me, thank you very much for your help.