question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CLI] pytorch-lightning and wandb - Abnormal program exit

See original GitHub issue

Hello,

Description I implemented a model with pytorch_lightning and WandLogger, but it seems wandb makes the system crash before the start of the training. The issue looks similar to this : https://github.com/wandb/client/issues/1293

Wandb features

I am using WandbLogger from pytorch_lightning.loggers I was not sure whether I should use only WandbLogger or both WandbLogger and wandb (the python library) I only did : from pytorch_lightning.loggers import WandbLogger and removed : import wandb

I used to initialize the logger like this :

wandb_logger = WandbLogger(project= 'april_sanbox')                          
                           config={
                               "machine" : socket.gethostname(),
                               "env" : os.environ['CONDA_DEFAULT_ENV'],
                               "lr" : float(args.lr),
                               "epsilon" : float(args.eps),
                               "epochs": args.epochs,
                               "model" : '',

                           })

But I just kept wandb_logger = WandbLogger(project= 'april_sanbox') in order to be as close as possible to the examples given in : https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch-lightning/Supercharge_your_Training_with_Pytorch_Lightning_%2B_Weights_%26_Biases.ipynb https://colab.research.google.com/drive/16d1uctGaw2y9KhGBlINNTsWpmlXdJwRW?usp=sharing

How to reproduce

The training loop :

wandb_logger = WandbLogger(project= 'april_sanbox')
'''                           
                           config={
                               "machine" : socket.gethostname(),
                               "env" : os.environ['CONDA_DEFAULT_ENV'],
                               "toy" : int(args.toy),

                               "doc_len" : args.max_seq_len_doc,
                               "parag_len" : args.max_seq_len_parag,
                               "parags_nbr" : args.max_nbr_parags,
                               "batch_size": args.bs,

                               "lr" : float(args.lr),
                               "epsilon" : float(args.eps),
                               "epochs": args.epochs,
                               "model" : '',

                           })

wandb.init(project= 'april_sanbox',
           config={
               "machine" : socket.gethostname(),
               "env" : os.environ['CONDA_DEFAULT_ENV'],
               "toy" : int(args.toy),

               "doc_len" : args.max_seq_len_doc,
               "parag_len" : args.max_seq_len_parag,
               "parags_nbr" : args.max_nbr_parags,
               "batch_size": args.bs,

               "lr" : float(args.lr),
               "epsilon" : float(args.eps),
               "epochs": args.epochs,
               "model" : '',

           }
           )
'''

# [...]


tokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')

train_valid_test_data_module = QHLD_DataModule(df_train=df_train, df_valid=df_valid, df_test=df_test,
                                               tokenizer=tokenizer,
                                               batch_size=args.bs,
                                               num_workers=cpu_count()-1)



# [...]

from aux_20F_train import VanillaCamembert
model = VanillaCamembert(nbr_targets=len(list_labels))

#wandb.config.update({"model": model.model_name}, allow_val_change=True)
#wandb.watch(model)

##########################################################
### TRAINING

trainer = pl.Trainer(logger=wandb_logger,
                     max_epochs=args.epochs,
                     gpus=torch.cuda.device_count())
trainer.fit(model=model,
            datamodule=train_valid_test_data_module)

The model:

class VanillaCamembert(pl.LightningModule) :
    def __init__(self, nbr_targets):
        super().__init__()
        self.nbr_targets = nbr_targets
        self.model_name = 'vanilla_camembert'

        self.bert = CamembertForSequenceClassification.from_pretrained('camembert-base',
                                                                       num_labels=self.nbr_targets)
        print(self.bert.config)


        self.save_hyperparameters()


    def forward(self, batch_input_ids, batch_att_masks):
        out = self.bert(batch_input_ids, batch_att_masks)

        return out.logits

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-5) 
        return optimizer

    def training_step(self, train_batch, batch_idx) :
        batch_input_ids, batch_att_masks, _, _, _, y_true = train_batch

        y_pred = self(batch_input_ids, batch_att_masks)

        loss = F.binary_cross_entropy_with_logits(y_pred, y_true.float())
        self.log('train_loss', loss, on_epoch=True)

        return loss


    def validation_step(self, valid_batch, batch_idx):
        batch_input_ids, batch_att_masks, _, _, _, y_true = valid_batch

        y_pred = self(batch_input_ids, batch_att_masks)

        loss = F.binary_cross_entropy_with_logits(y_pred, y_true.float())
        self.log('val_loss', loss)

Environment

  • OS: Operating System: Debian GNU/Linux 9 (stretch) Kernel: Linux 4.9.0-14-amd64 Architecture: x86-64

  • Environment: Anaconda (running script in terminal) # packages in environment at /u/salaunol/anaconda3/envs/h21_cuda101: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_gnu conda-forge absl-py 0.12.0 pyhd8ed1ab_0 conda-forge aiohttp 3.7.4.post0 pypi_0 pypi argon2-cffi 20.1.0 py38h25fe258_2 conda-forge async-timeout 3.0.1 py_1000 conda-forge async_generator 1.10 py_0 conda-forge attrs 20.3.0 pyhd3deb0d_0 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 py_2 conda-forge backports.functools_lru_cache 1.6.1 py_0 conda-forge blas 1.0 mkl
    bleach 3.3.0 pyh44b312d_0 conda-forge blinker 1.4 py_1 conda-forge brotlipy 0.7.0 py38h27cfd23_1003
    c-ares 1.17.1 h7f98852_1 conda-forge ca-certificates 2021.1.19 h06a4308_1
    cachetools 4.2.1 pyhd8ed1ab_0 conda-forge certifi 2020.12.5 py38h06a4308_0
    cffi 1.14.5 py38h261ae71_0
    chardet 4.0.0 py38h06a4308_1003
    click 7.1.2 pyhd3eb1b0_0
    configparser 5.0.2 pypi_0 pypi cryptography 3.4.6 py38hd23ed53_0
    cudatoolkit 10.1.243 h6bb024c_0
    dataclasses 0.8 pyh6d0b6a4_7
    decorator 4.4.2 py_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge docker-pycreds 0.4.0 pypi_0 pypi emoji 1.2.0 pypi_0 pypi entrypoints 0.3 pyhd8ed1ab_1003 conda-forge filelock 3.0.12 pyhd3eb1b0_1
    freetype 2.10.4 h5ab3b9f_0
    fsspec 0.8.7 pyhd8ed1ab_0 conda-forge future 0.18.2 py38h578d9bd_3 conda-forge gitdb 4.0.5 pypi_0 pypi gitpython 3.1.14 pypi_0 pypi google-auth 1.28.0 pyh44b312d_0 conda-forge google-auth-oauthlib 0.4.4 pypi_0 pypi grpcio 1.36.1 pypi_0 pypi icu 58.2 hf484d3e_1000 conda-forge idna 2.10 pyhd3eb1b0_0
    importlib-metadata 3.7.3 py38h578d9bd_0 conda-forge intel-openmp 2020.2 254
    ipykernel 5.5.0 py38h81c977d_1 conda-forge ipython 7.21.0 py38h81c977d_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jedi 0.18.0 py38h578d9bd_2 conda-forge jinja2 2.11.3 pyh44b312d_0 conda-forge joblib 1.0.1 pyhd3eb1b0_0
    jpeg 9b h024ee3a_2
    jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge jupyter_client 6.1.12 pyhd8ed1ab_0 conda-forge jupyter_contrib_core 0.3.3 py_2 conda-forge jupyter_contrib_nbextensions 0.5.1 pyhd8ed1ab_2 conda-forge jupyter_core 4.7.1 py38h578d9bd_0 conda-forge jupyter_highlight_selected_word 0.2.0 py38h578d9bd_1002 conda-forge jupyter_latex_envs 1.4.6 pyhd8ed1ab_1002 conda-forge jupyter_nbextensions_configurator 0.4.1 py38h578d9bd_2 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge lcms2 2.11 h396b838_0
    ld_impl_linux-64 2.33.1 h53a641e_7
    libffi 3.3 he6710b0_2
    libgcc-ng 9.3.0 h2828fa1_18 conda-forge libgfortran-ng 7.3.0 hdf63c60_0
    libgomp 9.3.0 h2828fa1_18 conda-forge libpng 1.6.37 hbc83047_0
    libprotobuf 3.14.0 h8c45485_0
    libsodium 1.0.18 h36c2ea0_1 conda-forge libstdcxx-ng 9.1.0 hdf63c60_0
    libtiff 4.2.0 h3942068_0
    libuv 1.40.0 h7b6447c_0
    libwebp-base 1.2.0 h27cfd23_0
    libxml2 2.9.10 hb55368b_3
    libxslt 1.1.34 hc22bd24_0
    lxml 4.6.2 py38h9120a33_0
    lz4-c 1.9.3 h2531618_0
    markdown 3.3.4 pyhd8ed1ab_0 conda-forge markupsafe 1.1.1 py38h8df0ef7_2 conda-forge mistune 0.8.4 py38h25fe258_1002 conda-forge mkl 2020.2 256
    mkl-service 2.3.0 py38he904b0f_0
    mkl_fft 1.3.0 py38h54f3939_0
    mkl_random 1.1.1 py38h0573a6f_0
    multidict 5.1.0 py38h497a2fe_1 conda-forge nbclient 0.5.3 pyhd8ed1ab_0 conda-forge nbconvert 6.0.7 py38h578d9bd_3 conda-forge nbformat 5.1.2 pyhd8ed1ab_1 conda-forge ncurses 6.2 he6710b0_1
    nest-asyncio 1.4.3 pyhd8ed1ab_0 conda-forge ninja 1.10.2 py38hff7bd54_0
    notebook 6.2.0 py38h578d9bd_0 conda-forge numpy 1.19.2 py38h54aff64_0
    numpy-base 1.19.2 py38hfa32c7d_0
    oauthlib 3.1.0 pypi_0 pypi olefile 0.46 py_0
    openssl 1.1.1k h27cfd23_0
    packaging 20.9 pyhd3eb1b0_0
    pandas 1.2.3 py38ha9443f7_0
    pandoc 2.12 h7f98852_0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parso 0.8.1 pyhd8ed1ab_0 conda-forge pathtools 0.1.2 pypi_0 pypi pexpect 4.8.0 pyh9f0ad1d_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 8.1.2 py38he98fc37_0
    pip 21.0.1 py38h06a4308_0
    prometheus_client 0.9.0 pyhd3deb0d_0 conda-forge promise 2.3 pypi_0 pypi prompt-toolkit 3.0.17 pyha770c72_0 conda-forge protobuf 3.14.0 py38h2531618_1
    psutil 5.8.0 pypi_0 pypi ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.8 pypi_0 pypi pycparser 2.20 py_2
    pygments 2.8.1 pyhd8ed1ab_0 conda-forge pyjwt 2.0.1 pyhd8ed1ab_1 conda-forge pyopenssl 20.0.1 pyhd3eb1b0_1
    pyparsing 2.4.7 pyhd3eb1b0_0
    pyrsistent 0.17.3 py38h25fe258_1 conda-forge pysocks 1.7.1 py38h06a4308_0
    python 3.8.8 hdb3f193_4
    python-dateutil 2.8.1 py_0 conda-forge python_abi 3.8 1_cp38 huggingface pytorch 1.7.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch pytorch-lightning 1.2.6 pyhd8ed1ab_0 conda-forge pytz 2021.1 pyhd3eb1b0_0
    pyyaml 5.3.1 pypi_0 pypi pyzmq 19.0.2 py38ha71036d_2 conda-forge readline 8.1 h27cfd23_0
    regex 2020.11.13 py38h27cfd23_0
    requests 2.25.1 pyhd3eb1b0_0
    requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge rsa 4.7.2 pyh44b312d_0 conda-forge sacremoses master py_0 huggingface scikit-learn 0.24.1 py38ha9443f7_0
    scipy 1.6.2 py38h91f5cce_0
    send2trash 1.5.0 py_0 conda-forge sentry-sdk 1.0.0 pypi_0 pypi setuptools 52.0.0 py38h06a4308_0
    shortuuid 1.0.1 pypi_0 pypi six 1.15.0 py38h06a4308_0
    smmap 3.0.5 pypi_0 pypi sqlite 3.35.1 hdfb4753_0
    subprocess32 3.5.4 pypi_0 pypi tensorboard 2.4.1 pyhd8ed1ab_0 conda-forge tensorboard-plugin-wit 1.8.0 pyh44b312d_0 conda-forge terminado 0.9.2 py38h578d9bd_0 conda-forge testpath 0.4.4 py_0 conda-forge threadpoolctl 2.1.0 pyh5ca1d4c_0
    tk 8.6.10 hbc83047_0
    tokenizers 0.10.1 py38_0 huggingface torchaudio 0.7.2 py38 pytorch torchmetrics 0.2.0 pyhd8ed1ab_0 conda-forge torchvision 0.8.2 py38_cu101 pytorch tornado 6.1 py38h25fe258_0 conda-forge tqdm 4.59.0 pyhd3eb1b0_1
    traitlets 5.0.5 py_0 conda-forge transformers 4.4.2 py_0 huggingface typing-extensions 3.7.4.3 0 conda-forge typing_extensions 3.7.4.3 py_0 conda-forge urllib3 1.26.4 pyhd3eb1b0_0
    wandb 0.10.25 pypi_0 pypi wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge webencodings 0.5.1 py_1 conda-forge werkzeug 1.0.1 pyh9f0ad1d_0 conda-forge wheel 0.36.2 pyhd3eb1b0_0
    xz 5.2.5 h7b6447c_0
    yaml 0.2.5 h516909a_0 conda-forge yarl 1.6.3 py38h497a2fe_1 conda-forge zeromq 4.3.3 he6710b0_3
    zipp 3.4.1 pyhd8ed1ab_0 conda-forge zlib 1.2.11 h7b6447c_3
    zstd 1.4.5 h9ceee32_0

  • Python Version: Python 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] :: Anaconda, Inc. on linux

Terminal output

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]                                                                                                                                          
wandb: Currently logged in as: osalaun (use `wandb login --relogin` to force relogin)                                                                                                
Problem at: /u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py 155 experiment                                                     
Traceback (most recent call last):                       
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 742, in init                                                               
    run = wi.init()                                                                                                                                                                  
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 421, in init                                                               
    backend.ensure_launched()                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 125, in ensure_launched                                               
    self.wandb_process.start()                                                                                                                                                       
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/process.py", line 121, in start                                                                         
    self._popen = self._Popen(self)                                                                                                                                                  
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/context.py", line 284, in _Popen                                                                        
    return Popen(process_obj)                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__                                                             
    super().__init__(process_obj)                                                                                                                                                    
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__                                                                    
    self._launch(process_obj)                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch                                                              
    prep_data = spawn.get_preparation_data(process_obj._name)                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data                                                            
    _check_not_importing_main()                                                                                                                                                      
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main                                                       
    raise RuntimeError('''                                                                                                                                                           
RuntimeError:                                                                                                                                                                        
        An attempt has been made to start a new process before the                                                                                                                   
        current process has finished its bootstrapping phase.                                                                                                                        
                                                                                                                                                                                     
        This probably means that you are not using fork to start your                                                                                                                
        child processes and you have forgotten to use the proper idiom                                                                                                               
        in the main module:                                                                                                                                                          
                                                                                                                                                                                     
            if __name__ == '__main__':                                                                                                                                               
                freeze_support()                                                                                                                                                     
                ...                                                                                                                                                                  
                                                                                                                                                                                     
        The "freeze_support()" line can be omitted if the program                                                                                                                    
        is not going to be frozen to produce an executable.                                                                                                                          
wandb: ERROR Abnormal program exit                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                   
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 742, in init                                                               
    run = wi.init()                                                                                                                                                                  
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 421, in init                                                               
    backend.ensure_launched()                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 125, in ensure_launched                                               
    self.wandb_process.start()                                                                                                                                                       
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/process.py", line 121, in start                                                                         
    self._popen = self._Popen(self)                                                                                                                                                  
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/context.py", line 284, in _Popen                                                                        
    return Popen(process_obj)                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__                                                             
    super().__init__(process_obj)                                                                                                                                                    
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__                                                                    
    self._launch(process_obj)                                                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch                                                              
    prep_data = spawn.get_preparation_data(process_obj._name)                                                                                                                        
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data                                                            
    _check_not_importing_main()                                                                                                                                                      
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/u/salaunol/Documents/_2021_hiver/QHLD/20F_train.py", line 218, in <module>
    trainer.fit(model=model,
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 530, in pre_dispatch
    self.logger.log_hyperparams(self.lightning_module.hparams_initial)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 184, in log_hyperparams
    self.experiment.config.update(params, allow_val_change=True)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
    return get_experiment() or DummyExperiment()
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
    return fn(self)
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 155, in experiment
    self._experiment = wandb.init(
  File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 779, in init
    six.raise_from(Exception("problem"), error_seen)
  File "<string>", line 3, in raise_from
Exception: problem

Edit : added terminal output

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
oliviersalauncommented, Apr 10, 2021

@borisdayma I have not updated transformers yet, but I think I found the origin of the problem. When initializing the lightning Trainer, I passed the following arguments:

trainer = pl.Trainer(logger=wandb_logger,
                     max_epochs=args.epochs,
                     gpus=torch.cuda.device_count())

Although the machine has two GPUs, only one is actually suitable for training (the other one is just for the monitor). Still, torch.cuda.device_count() returns 2. If I manually set:

trainer = pl.Trainer(logger=wandb_logger,
                     max_epochs=args.epochs,
                     gpus=1)

then, the problem is solved.

Edit: @vanpelt I already double-checked about wandb.init, there was none of that.

0reactions
ariG23498commented, May 3, 2021

Hey @oliviersalaun

Thanks for following up with the thread and helping with the solution. Would you like to close the thread if the issue has been resolved?

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CLI] pytorch-lightning and wandb - Abnormal program exit
Hello, Description I implemented a model with pytorch_lightning and WandLogger, but it seems wandb makes the system crash before the start ...
Read more >
Traceback error - W&B Help
Hey guys, I am totally new to W&B. I am getting a Traceback error when I want to run “wandb.init(project=”…“)”. Last week it...
Read more >
wandb — PyTorch Lightning 1.8.5.post0 documentation
A new W&B run will be created when training starts if you have not created one manually before with wandb.init() . Log metrics....
Read more >
attributeerror: module 'wandb' has no attribute 'init'
"Windows" and sys.stdout.encoding == "UTF-8": AttributeError: 'Logger' object has no attribute 'encoding' wandb: ERROR Abnormal program exit Traceback (most ...
Read more >
reidsanders.net
My logging situation was complicated by wandb errors on multiple cores. ... --help show this help message and exit --train_set TRAIN_SET the training...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found