[CLI] pytorch-lightning and wandb - Abnormal program exit
See original GitHub issueHello,
Description I implemented a model with pytorch_lightning and WandLogger, but it seems wandb makes the system crash before the start of the training. The issue looks similar to this : https://github.com/wandb/client/issues/1293
Wandb features
I am using WandbLogger
from pytorch_lightning.loggers
I was not sure whether I should use only WandbLogger
or both WandbLogger
and wandb
(the python library)
I only did :
from pytorch_lightning.loggers import WandbLogger
and removed :
import wandb
I used to initialize the logger like this :
wandb_logger = WandbLogger(project= 'april_sanbox')
config={
"machine" : socket.gethostname(),
"env" : os.environ['CONDA_DEFAULT_ENV'],
"lr" : float(args.lr),
"epsilon" : float(args.eps),
"epochs": args.epochs,
"model" : '',
})
But I just kept wandb_logger = WandbLogger(project= 'april_sanbox')
in order to be as close as possible to the examples given in :
https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch-lightning/Supercharge_your_Training_with_Pytorch_Lightning_%2B_Weights_%26_Biases.ipynb
https://colab.research.google.com/drive/16d1uctGaw2y9KhGBlINNTsWpmlXdJwRW?usp=sharing
How to reproduce
The training loop :
wandb_logger = WandbLogger(project= 'april_sanbox')
'''
config={
"machine" : socket.gethostname(),
"env" : os.environ['CONDA_DEFAULT_ENV'],
"toy" : int(args.toy),
"doc_len" : args.max_seq_len_doc,
"parag_len" : args.max_seq_len_parag,
"parags_nbr" : args.max_nbr_parags,
"batch_size": args.bs,
"lr" : float(args.lr),
"epsilon" : float(args.eps),
"epochs": args.epochs,
"model" : '',
})
wandb.init(project= 'april_sanbox',
config={
"machine" : socket.gethostname(),
"env" : os.environ['CONDA_DEFAULT_ENV'],
"toy" : int(args.toy),
"doc_len" : args.max_seq_len_doc,
"parag_len" : args.max_seq_len_parag,
"parags_nbr" : args.max_nbr_parags,
"batch_size": args.bs,
"lr" : float(args.lr),
"epsilon" : float(args.eps),
"epochs": args.epochs,
"model" : '',
}
)
'''
# [...]
tokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')
train_valid_test_data_module = QHLD_DataModule(df_train=df_train, df_valid=df_valid, df_test=df_test,
tokenizer=tokenizer,
batch_size=args.bs,
num_workers=cpu_count()-1)
# [...]
from aux_20F_train import VanillaCamembert
model = VanillaCamembert(nbr_targets=len(list_labels))
#wandb.config.update({"model": model.model_name}, allow_val_change=True)
#wandb.watch(model)
##########################################################
### TRAINING
trainer = pl.Trainer(logger=wandb_logger,
max_epochs=args.epochs,
gpus=torch.cuda.device_count())
trainer.fit(model=model,
datamodule=train_valid_test_data_module)
The model:
class VanillaCamembert(pl.LightningModule) :
def __init__(self, nbr_targets):
super().__init__()
self.nbr_targets = nbr_targets
self.model_name = 'vanilla_camembert'
self.bert = CamembertForSequenceClassification.from_pretrained('camembert-base',
num_labels=self.nbr_targets)
print(self.bert.config)
self.save_hyperparameters()
def forward(self, batch_input_ids, batch_att_masks):
out = self.bert(batch_input_ids, batch_att_masks)
return out.logits
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-5)
return optimizer
def training_step(self, train_batch, batch_idx) :
batch_input_ids, batch_att_masks, _, _, _, y_true = train_batch
y_pred = self(batch_input_ids, batch_att_masks)
loss = F.binary_cross_entropy_with_logits(y_pred, y_true.float())
self.log('train_loss', loss, on_epoch=True)
return loss
def validation_step(self, valid_batch, batch_idx):
batch_input_ids, batch_att_masks, _, _, _, y_true = valid_batch
y_pred = self(batch_input_ids, batch_att_masks)
loss = F.binary_cross_entropy_with_logits(y_pred, y_true.float())
self.log('val_loss', loss)
Environment
-
OS: Operating System: Debian GNU/Linux 9 (stretch) Kernel: Linux 4.9.0-14-amd64 Architecture: x86-64
-
Environment: Anaconda (running script in terminal) # packages in environment at /u/salaunol/anaconda3/envs/h21_cuda101: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_gnu conda-forge absl-py 0.12.0 pyhd8ed1ab_0 conda-forge aiohttp 3.7.4.post0 pypi_0 pypi argon2-cffi 20.1.0 py38h25fe258_2 conda-forge async-timeout 3.0.1 py_1000 conda-forge async_generator 1.10 py_0 conda-forge attrs 20.3.0 pyhd3deb0d_0 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 py_2 conda-forge backports.functools_lru_cache 1.6.1 py_0 conda-forge blas 1.0 mkl
bleach 3.3.0 pyh44b312d_0 conda-forge blinker 1.4 py_1 conda-forge brotlipy 0.7.0 py38h27cfd23_1003
c-ares 1.17.1 h7f98852_1 conda-forge ca-certificates 2021.1.19 h06a4308_1
cachetools 4.2.1 pyhd8ed1ab_0 conda-forge certifi 2020.12.5 py38h06a4308_0
cffi 1.14.5 py38h261ae71_0
chardet 4.0.0 py38h06a4308_1003
click 7.1.2 pyhd3eb1b0_0
configparser 5.0.2 pypi_0 pypi cryptography 3.4.6 py38hd23ed53_0
cudatoolkit 10.1.243 h6bb024c_0
dataclasses 0.8 pyh6d0b6a4_7
decorator 4.4.2 py_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge docker-pycreds 0.4.0 pypi_0 pypi emoji 1.2.0 pypi_0 pypi entrypoints 0.3 pyhd8ed1ab_1003 conda-forge filelock 3.0.12 pyhd3eb1b0_1
freetype 2.10.4 h5ab3b9f_0
fsspec 0.8.7 pyhd8ed1ab_0 conda-forge future 0.18.2 py38h578d9bd_3 conda-forge gitdb 4.0.5 pypi_0 pypi gitpython 3.1.14 pypi_0 pypi google-auth 1.28.0 pyh44b312d_0 conda-forge google-auth-oauthlib 0.4.4 pypi_0 pypi grpcio 1.36.1 pypi_0 pypi icu 58.2 hf484d3e_1000 conda-forge idna 2.10 pyhd3eb1b0_0
importlib-metadata 3.7.3 py38h578d9bd_0 conda-forge intel-openmp 2020.2 254
ipykernel 5.5.0 py38h81c977d_1 conda-forge ipython 7.21.0 py38h81c977d_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jedi 0.18.0 py38h578d9bd_2 conda-forge jinja2 2.11.3 pyh44b312d_0 conda-forge joblib 1.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge jupyter_client 6.1.12 pyhd8ed1ab_0 conda-forge jupyter_contrib_core 0.3.3 py_2 conda-forge jupyter_contrib_nbextensions 0.5.1 pyhd8ed1ab_2 conda-forge jupyter_core 4.7.1 py38h578d9bd_0 conda-forge jupyter_highlight_selected_word 0.2.0 py38h578d9bd_1002 conda-forge jupyter_latex_envs 1.4.6 pyhd8ed1ab_1002 conda-forge jupyter_nbextensions_configurator 0.4.1 py38h578d9bd_2 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
libffi 3.3 he6710b0_2
libgcc-ng 9.3.0 h2828fa1_18 conda-forge libgfortran-ng 7.3.0 hdf63c60_0
libgomp 9.3.0 h2828fa1_18 conda-forge libpng 1.6.37 hbc83047_0
libprotobuf 3.14.0 h8c45485_0
libsodium 1.0.18 h36c2ea0_1 conda-forge libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.2.0 h3942068_0
libuv 1.40.0 h7b6447c_0
libwebp-base 1.2.0 h27cfd23_0
libxml2 2.9.10 hb55368b_3
libxslt 1.1.34 hc22bd24_0
lxml 4.6.2 py38h9120a33_0
lz4-c 1.9.3 h2531618_0
markdown 3.3.4 pyhd8ed1ab_0 conda-forge markupsafe 1.1.1 py38h8df0ef7_2 conda-forge mistune 0.8.4 py38h25fe258_1002 conda-forge mkl 2020.2 256
mkl-service 2.3.0 py38he904b0f_0
mkl_fft 1.3.0 py38h54f3939_0
mkl_random 1.1.1 py38h0573a6f_0
multidict 5.1.0 py38h497a2fe_1 conda-forge nbclient 0.5.3 pyhd8ed1ab_0 conda-forge nbconvert 6.0.7 py38h578d9bd_3 conda-forge nbformat 5.1.2 pyhd8ed1ab_1 conda-forge ncurses 6.2 he6710b0_1
nest-asyncio 1.4.3 pyhd8ed1ab_0 conda-forge ninja 1.10.2 py38hff7bd54_0
notebook 6.2.0 py38h578d9bd_0 conda-forge numpy 1.19.2 py38h54aff64_0
numpy-base 1.19.2 py38hfa32c7d_0
oauthlib 3.1.0 pypi_0 pypi olefile 0.46 py_0
openssl 1.1.1k h27cfd23_0
packaging 20.9 pyhd3eb1b0_0
pandas 1.2.3 py38ha9443f7_0
pandoc 2.12 h7f98852_0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parso 0.8.1 pyhd8ed1ab_0 conda-forge pathtools 0.1.2 pypi_0 pypi pexpect 4.8.0 pyh9f0ad1d_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 8.1.2 py38he98fc37_0
pip 21.0.1 py38h06a4308_0
prometheus_client 0.9.0 pyhd3deb0d_0 conda-forge promise 2.3 pypi_0 pypi prompt-toolkit 3.0.17 pyha770c72_0 conda-forge protobuf 3.14.0 py38h2531618_1
psutil 5.8.0 pypi_0 pypi ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.8 pypi_0 pypi pycparser 2.20 py_2
pygments 2.8.1 pyhd8ed1ab_0 conda-forge pyjwt 2.0.1 pyhd8ed1ab_1 conda-forge pyopenssl 20.0.1 pyhd3eb1b0_1
pyparsing 2.4.7 pyhd3eb1b0_0
pyrsistent 0.17.3 py38h25fe258_1 conda-forge pysocks 1.7.1 py38h06a4308_0
python 3.8.8 hdb3f193_4
python-dateutil 2.8.1 py_0 conda-forge python_abi 3.8 1_cp38 huggingface pytorch 1.7.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch pytorch-lightning 1.2.6 pyhd8ed1ab_0 conda-forge pytz 2021.1 pyhd3eb1b0_0
pyyaml 5.3.1 pypi_0 pypi pyzmq 19.0.2 py38ha71036d_2 conda-forge readline 8.1 h27cfd23_0
regex 2020.11.13 py38h27cfd23_0
requests 2.25.1 pyhd3eb1b0_0
requests-oauthlib 1.3.0 pyh9f0ad1d_0 conda-forge rsa 4.7.2 pyh44b312d_0 conda-forge sacremoses master py_0 huggingface scikit-learn 0.24.1 py38ha9443f7_0
scipy 1.6.2 py38h91f5cce_0
send2trash 1.5.0 py_0 conda-forge sentry-sdk 1.0.0 pypi_0 pypi setuptools 52.0.0 py38h06a4308_0
shortuuid 1.0.1 pypi_0 pypi six 1.15.0 py38h06a4308_0
smmap 3.0.5 pypi_0 pypi sqlite 3.35.1 hdfb4753_0
subprocess32 3.5.4 pypi_0 pypi tensorboard 2.4.1 pyhd8ed1ab_0 conda-forge tensorboard-plugin-wit 1.8.0 pyh44b312d_0 conda-forge terminado 0.9.2 py38h578d9bd_0 conda-forge testpath 0.4.4 py_0 conda-forge threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
tokenizers 0.10.1 py38_0 huggingface torchaudio 0.7.2 py38 pytorch torchmetrics 0.2.0 pyhd8ed1ab_0 conda-forge torchvision 0.8.2 py38_cu101 pytorch tornado 6.1 py38h25fe258_0 conda-forge tqdm 4.59.0 pyhd3eb1b0_1
traitlets 5.0.5 py_0 conda-forge transformers 4.4.2 py_0 huggingface typing-extensions 3.7.4.3 0 conda-forge typing_extensions 3.7.4.3 py_0 conda-forge urllib3 1.26.4 pyhd3eb1b0_0
wandb 0.10.25 pypi_0 pypi wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge webencodings 0.5.1 py_1 conda-forge werkzeug 1.0.1 pyh9f0ad1d_0 conda-forge wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h516909a_0 conda-forge yarl 1.6.3 py38h497a2fe_1 conda-forge zeromq 4.3.3 he6710b0_3
zipp 3.4.1 pyhd8ed1ab_0 conda-forge zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0 -
Python Version: Python 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0] :: Anaconda, Inc. on linux
Terminal output
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: osalaun (use `wandb login --relogin` to force relogin)
Problem at: /u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py 155 experiment
Traceback (most recent call last):
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 742, in init
run = wi.init()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 421, in init
backend.ensure_launched()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 125, in ensure_launched
self.wandb_process.start()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 742, in init
run = wi.init()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 421, in init
backend.ensure_launched()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 125, in ensure_launched
self.wandb_process.start()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/u/salaunol/Documents/_2021_hiver/QHLD/20F_train.py", line 218, in <module>
trainer.fit(model=model,
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
self.pre_dispatch()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 530, in pre_dispatch
self.logger.log_hyperparams(self.lightning_module.hparams_initial)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 184, in log_hyperparams
self.experiment.config.update(params, allow_val_change=True)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 155, in experiment
self._experiment = wandb.init(
File "/u/salaunol/anaconda3/envs/h21_cuda101/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 779, in init
six.raise_from(Exception("problem"), error_seen)
File "<string>", line 3, in raise_from
Exception: problem
Edit : added terminal output
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)
@borisdayma I have not updated
transformers
yet, but I think I found the origin of the problem. When initializing the lightning Trainer, I passed the following arguments:Although the machine has two GPUs, only one is actually suitable for training (the other one is just for the monitor). Still,
torch.cuda.device_count()
returns2
. If I manually set:then, the problem is solved.
Edit: @vanpelt I already double-checked about
wandb.init
, there was none of that.Hey @oliviersalaun
Thanks for following up with the thread and helping with the solution. Would you like to close the thread if the issue has been resolved?