question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Unable to train on multiple GPUs in EC2 Terminal

See original GitHub issue
  • I have checked that this bug exists on the latest stable version of AutoGluon
  • and/or I have checked that this bug exists on the latest mainline of AutoGluon via source installation

Describe the bug This error may have the same root cause as in Issue #1650. Autogluon 0.4.3 AutoMMPredictor training on P3dn.24xlarge instance in EC2 terminal, with env.num_gpus: -1/2/3/… setting. I get an error in spawning multiprocessing, but it works fine with env.num_gpus: 1. Also if put the codes under if __name__ == "__main__", it works fine.

Expected behavior Training with multiple GPUs in a script without if __name__ == "__main__".

To Reproduce EC2 P3dn.24xlarge python 3.8 conda create --name <env_name> python=3.8 bash full_install.sh python automm_distillation.py

code with Runtime Error:

    from autogluon.text.automm import AutoMMPredictor
    from datasets import load_dataset
     
    dataset = load_dataset('glue', 'mnli')
    train_dataset = dataset['train']
     
    train_df = train_dataset.to_pandas()
    train_df = train_df.drop('idx', axis=1)
     
    valid_df = dataset['validation_matched'].to_pandas()
     
    test_df = dataset['test_matched'].to_pandas()
     
    predictor = AutoMMPredictor(label='label')
    predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)
     
    predictions = predictor.predict(valid_df)
    prediction_prob = predictor.predict_proba(valid_df)
     
    from autogluon.text.automm.constants import (
        MODEL,
        DATA,
        OPTIMIZATION,
        ENVIRONMENT,
        DISTILLER
    )
     
    student_predictor = AutoMMPredictor(label="label")
     
    config = {
        MODEL: f"fusion_mlp_image_text_tabular",
        DATA: "default",
        DISTILLER: 'default',
        OPTIMIZATION: "adamw",
        ENVIRONMENT: "default",
    }
     
    student_predictor.fit(
        train_df,
        config=config,
        hyperparameters={'env.num_gpus': -1, 'optimization.max_epochs': 5,
                         'model.hf_text.checkpoint_name': 'huawei-noah/TinyBERT_General_4L_312D'},
        teacher_predictor=predictor,
        time_limit=10
    )
     
    student_predictions = student_predictor.predict(valid_df)

Code that works Fine:

from autogluon.text.automm import AutoMMPredictor
from datasets import load_dataset
#
if __name__ == '__main__':
    dataset = load_dataset('glue', 'mnli')
    train_dataset = dataset['train']

    train_df = train_dataset.to_pandas()
    train_df = train_df.drop('idx', axis=1)

    valid_df = dataset['validation_matched'].to_pandas()

    test_df = dataset['test_matched'].to_pandas()

    predictor = AutoMMPredictor(label='label')
    predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)

    predictions = predictor.predict(valid_df)
    prediction_prob = predictor.predict_proba(valid_df)

    from autogluon.text.automm.constants import (
        MODEL,
        DATA,
        OPTIMIZATION,
        ENVIRONMENT,
        DISTILLER
    )

    student_predictor = AutoMMPredictor(label="label")

    config = {
        MODEL: f"fusion_mlp_image_text_tabular",
        DATA: "default",
        DISTILLER: 'default',
        OPTIMIZATION: "adamw",
        ENVIRONMENT: "default",
    }

    student_predictor.fit(
        train_df,
        config=config,
        hyperparameters={'env.num_gpus': -1, 'optimization.max_epochs': 5,
                         'model.hf_text.checkpoint_name': 'huawei-noah/TinyBERT_General_4L_312D'},
        teacher_predictor=predictor,
        time_limit=10
    )

    student_predictions = student_predictor.predict(valid_df)

Screenshots Error:

Traceback (most recent call last):                                                                                                                                                        
  File "<string>", line 1, in <module>                                                                                                                                                    
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main                                                                            
    exitcode = _main(fd, parent_sentinel)                                                                                                                                                 
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 125, in _main                                                                                 
    prepare(preparation_data)                                                                                                                                                             
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare                                                                               
    _fixup_main_from_path(data['init_main_from_path'])                                                                                                                                    
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path                                                                 
    main_content = runpy.run_path(main_path,                                                                                                                                              
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 265, in run_path                                                                                              
    return _run_module_code(code, init_globals, run_name,                                                                                                                                 
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 97, in _run_module_code                                                                                       
    _run_code(code, mod_globals, init_globals,                                                                                                                                            
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/runpy.py", line 87, in _run_code                                                                                              
    exec(code, run_globals)                                                                                                                                                               
  File "/media/code/autogluon/examples/automm/test_distillation.py", line 15, in <module>                                                                                                 
    predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10)                                                                                                          
  File "/media/code/autogluon/text/src/autogluon/text/automm/predictor.py", line 469, in fit                                                                                              
    self._fit(**_fit_args)                                                                                                                                                                
  File "/media/code/autogluon/text/src/autogluon/text/automm/predictor.py", line 992, in _fit                                                                                             
    trainer.fit(                                                                                                                                                                          
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit                                                         
    self._call_and_handle_interrupt(                                                                                                                                                      
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt                                  
    return trainer_fn(*args, **kwargs)                                                                                                                                                    
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl                                                   
    self._run(model, ckpt_path=ckpt_path)                                                                                                                                                 
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run                                                       
    self._dispatch()                                                                                                                                                                      
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch                                                  
    self.training_type_plugin.start_training(self)                                                                                                                                        
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training                              
    self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)                                                                                                             
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn                                       
    mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)                                                                              
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn                                                             
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')                                                                                                          
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes                                                   
    process.start()                                                                                                                                                                       
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/process.py", line 121, in start                                                                               
    self._popen = self._Popen(self)                                                                                                                                                       
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/context.py", line 284, in _Popen                                                                              
    return Popen(process_obj)                                                                                                                                                             
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__                                                                   
    super().__init__(process_obj)                                                                                                                                                         
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__                                                                          
    self._launch(process_obj)                                                                                                                                                             
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch                                                                    
    prep_data = spawn.get_preparation_data(process_obj._name)                                                                                                                             
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data                                                                  
    _check_not_importing_main()                                                                                                                                                           
  File "/home/ubuntu/anaconda3/envs/autogluon/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main                                                             
    raise RuntimeError('''                                                                                                                                                                
RuntimeError:                                                                                                                                                                             
        An attempt has been made to start a new process before the                                                                                                                        
        current process has finished its bootstrapping phase.                                                                                                                             
                                                                                                                                                                                          
        This probably means that you are not using fork to start your                                                                                                                     
        child processes and you have forgotten to use the proper idiom                                                                                                                    
        in the main module:                                                                                                                                                               
                                                                                                                                                                                          
            if __name__ == '__main__':                                                                                                                                                    
                freeze_support()                                                                                                                                                          
                ...                                                                                                                                                                       
                                                                                                                                                                                          
        The "freeze_support()" line can be omitted if the program                                                                                                                         
        is not going to be frozen to produce an executable.

Installed Versions 0.4.3

INSTALLED VERSIONS
------------------
date                 : 2022-06-14
time                 : 23:01:00.752658
python               : 3.8.13.final.0
OS                   : Linux
OS-release           : 5.4.0-1072-aws
Version              : #77~18.04.1-Ubuntu SMP Thu Apr 7 21:38:47 UTC 2022
machine              : x86_64
processor            : x86_64
num_cores            : 96
cpu_ram_mb           : 765710
cuda version         : 11.450.142.00
num_gpus             : 8
gpu_ram_mb           : [25846, 22929, 21363, 22455, 21429, 23105, 20799, 16888]
avail_disk_size_mb   : 24049

autogluon.common     : 0.4.3b20220614
autogluon.core       : 0.4.3b20220614
autogluon.features   : 0.4.3b20220614
autogluon.tabular    : 0.4.3b20220614
autogluon.text       : 0.4.3b20220614
autogluon.timeseries : 0.4.3b20220614
autogluon.vision     : 0.4.3b20220614
autogluon_contrib_nlp: 0.0.1
boto3                : 1.24.7
catboost             : 1.0.6
dask                 : 2021.11.2
distributed          : 2021.11.2
fairscale            : 0.4.6
fastai               : 2.5.6
gluoncv              : 0.11.0
gluonts              : 0.9.4
hyperopt             : 0.2.7
lightgbm             : 3.3.2
matplotlib           : 3.5.2
mxnet                : 1.9.1
networkx             : 2.8.3
nlpaug               : 1.1.10
nltk                 : 3.7
nptyping             : 1.4.4
numpy                : 1.22.4
omegaconf            : 2.1.2
pandas               : 1.3.5
PIL                  : 9.0.1
protobuf             : None
psutil               : 5.8.0
pytorch_lightning    : 1.5.10
ray                  : 1.12.1
ray-lightning        : None
requests             : 2.28.0
scipy                : 1.7.3
sentencepiece        : None
setuptools           : 59.5.0
skimage              : 0.19.3
sklearn              : 1.0.2
smart_open           : 5.2.1
timm                 : 0.5.4
torch                : 1.10.2+cu102
torchmetrics         : 0.7.3
tqdm                 : 4.64.0
transformers         : 4.16.2
xgboost              : 1.4.2

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
sxjsciencecommented, Jun 15, 2022

Would you try to add 'env.strategy': 'ddp'? Key in https://github.com/awslabs/autogluon/blob/1543041c1e924f2bdb697be64ab61cd839cb7838/text/src/autogluon/text/automm/configs/environment/default.yaml#L14

You can enable that by changing predictor.fit(train_df, hyperparameters={'env.num_gpus': -1}, time_limit=10) to

predictor.fit(train_df, hyperparameters={'env.num_gpus': -1, 'env.strategy': 'ddp'}, time_limit=10)
0reactions
BaJiaoXiaoZicommented, Dec 14, 2022

Hello, I’m a beginner and have a similar problem when I run multiple Gpus on Windows using ‘env.strategy’: ‘ddp’ and if I set ‘env.num_gpus’: -1, the error is ’ RuntimeError: Distributed package doesn’t have NCCL built in '. If you know the answer, please teach me, thank you very much for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] Unable to train on multiple GPUs in EC2 Terminal #1821
24xlarge instance in EC2 terminal, with env.num_gpus: -1/2/3/... setting. I get an error in spawning multiprocessing, but it works fine with env ...
Read more >
Troubleshoot connecting to your instance - AWS Documentation
Common causes for connection issues; Error connecting to your instance: Connection timed out; Error: unable to load key ... Expecting: ANY PRIVATE KEY ......
Read more >
Problems with multi-gpus - MATLAB Answers - MathWorks
I have no problem training with a single gpu, but when I try to train with multiple gpus, matlab generates the following error:...
Read more >
Multi-GPU Training - YOLOv5 Documentation
Multi-GPU Training. This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s).
Read more >
How to set up a GPU instance for machine learning on AWS
A solution to this problem is using a Graphics Processing Unit (GPU). Training deep learning models involves multiple matrix multiplications ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found