WandbLogger causes the program to crash without an error
See original GitHub issue🐛 Bug
In my simple setup, I have been using TensorboardLogger and other loggers and everything worked perfectly. However, when I tried to use WandbLogger, the program crashes without any error. It just prints a dictionary and stops.
To Reproduce
{'_identity': (1,), '_config': {'authkey': b'i\x01FT\x8b\xe0\xf6e}\xe7\xce\xe7\xa1\xb5\x9a\x9bF2A?\x95\x95\xe1\x85I\x82a\xa4\xef4\xc3=', 'semprefix': '/mp'}, '_parent_pid': 4007964, '_parent_name': 'MainProcess', '_popen': None, '_closed': False, '_target': <function wandb_internal at 0x7f4792fd4550>, '_args': (), '_kwargs': {'settings': {'_args': [], '_cli_only_mode': None, '_colab': False, '_config_dict': None, '_console': <SettingsConsole.REDIRECT: 2>, '_cuda': None, '_disable_meta': None, '_disable_stats': None, '_disable_viewer': None, '_except_exit': None, '_executable': '/home/mazen/miniconda3/envs/jdt/bin/python', '_internal_check_process': 8, '_internal_queue_timeout': 2, '_jupyter': False, '_jupyter_name': None, '_jupyter_path': None, '_jupyter_root': None, '_kaggle': False, '_noop': False, '_offline': False, '_os': 'Linux-5.13.0-28-generic-x86_64-with-glibc2.31', '_platform': 'linux', '_python': '3.9.7', '_require_service': None, '_runqueue_item_id': None, '_save_requirements': True, '_service_transport': None, '_start_datetime': datetime.datetime(2022, 3, 25, 19, 17, 2, 616686), '_start_time': 1648261022.616686, '_tmp_code_dir': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/tmp/code', '_tracelog': None, '_unsaved_keys': None, '_windows': False, 'allow_val_change': None, 'anonymous': None, 'api_key': None, 'base_url': 'https://api.wandb.ai', 'code_dir': None, 'config_paths': None, 'console': 'auto', 'deployment': 'cloud', 'disable_code': None, 'disable_git': False, 'disabled': False, 'docker': None, 'email': 'mazen.ota@gmail.com', 'entity': None, 'files_dir': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/files', 'force': None, 'git_remote': 'origin', 'heartbeat_seconds': 30, 'host': 'mazen-HP-Z640-Workstation', 'ignore_globs': (), 'is_local': False, 'label_disable': None, 'launch': None, 'launch_config_path': None, 'log_dir': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/logs', 'log_internal': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/logs/debug-internal.log', 'log_symlink_internal': '/home/mazen/Projects/pl_jdt/output/wandb/debug-internal.log', 'log_symlink_user': '/home/mazen/Projects/pl_jdt/output/wandb/debug.log', 'log_user': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/logs/debug.log', 'login_timeout': None, 'magic': None, 'mode': 'online', 'notebook_name': None, 'problem': 'fatal', 'program': '/home/mazen/Projects/pl_jdt/scripts/train.py', 'program_relpath': 'scripts/train.py', 'project': 'mnist_training_test', 'project_url': '', 'quiet': None, 'reinit': None, 'relogin': None, 'resume': 'allow', 'resume_fname': '/home/mazen/Projects/pl_jdt/output/wandb/wandb-resume.json', 'resumed': False, 'root_dir': '/home/mazen/Projects/pl_jdt/output', 'run_group': None, 'run_id': '1wdniopy', 'run_job_type': None, 'run_mode': 'run', 'run_name': None, 'run_notes': None, 'run_tags': None, 'run_url': '', 'sagemaker_disable': None, 'save_code': True, 'settings_system': '/home/mazen/.config/wandb/settings', 'settings_workspace': '/home/mazen/Projects/pl_jdt/output/wandb/settings', 'show_colors': None, 'show_emoji': None, 'show_errors': True, 'show_info': True, 'show_warnings': True, 'silent': False, 'start_method': None, 'strict': None, 'summary_errors': None, 'summary_warnings': 5, 'sweep_id': None, 'sweep_param_path': None, 'sweep_url': '', 'symlink': True, 'sync_dir': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy', 'sync_file': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/run-1wdniopy.wandb', 'sync_symlink_latest': '/home/mazen/Projects/pl_jdt/output/wandb/latest-run', 'system_sample': 15, 'system_sample_seconds': 2, 'timespec': '20220325_191702', 'tmp_dir': '/home/mazen/Projects/pl_jdt/output/wandb/run-20220325_191702-1wdniopy/tmp', 'username': 'mazen', 'wandb_dir': '/home/mazen/Projects/pl_jdt/output/wandb/', '_log_level': 10}, 'record_q': <multiprocessing.queues.Queue object at 0x7f478fceb430>, 'result_q': <multiprocessing.queues.Queue object at 0x7f478fcfb100>, 'user_pid': 4007964}, '_name': 'wandb_internal'} <_io.BytesIO object at 0x7f478f477e50>
Expected behavior
I am expecting for WandbLogger to work as the other loggers work as expected.
Environment
* CUDA:
- GPU:
- NVIDIA TITAN X (Pascal)
- NVIDIA TITAN X (Pascal)
- available: True
- version: 11.3
* Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.11.0
- pytorch-lightning: 1.5.10
- tqdm: 4.62.3
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.7
- version: #31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022
- PyTorch Lightning Version (e.g., 1.5.0): 1.5.10
- PyTorch Version (e.g., 1.10): 1.11.0
- Python version (e.g., 3.9): 3.9.7
- OS (e.g., Linux): Linux (#31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022)
- CUDA/cuDNN version: 11.3
- GPU models and configuration: NVIDIA TITAN X (Pascal)
- How you installed PyTorch (
conda
,pip
, source): conda - Any other relevant information: My script works on a single and DDP. I have been trying to have WandbLogger to work on both but I am getting the same error.
Additional context
I am refactoring my code so I have updated PyTorch to 1.11 and PyTorch Lightning to 1.5.10. I have started developing setup by step so I can ensure that everything is working as expected. I am using Hydra
and Rich
as the only external libraries (besides PyTorch, PyTorch Lightning, torchvision, and torchmetric).
cc @awaelchli @morganmcg1 @AyushExel @borisdayma @scottire @manangoel99
Issue Analytics
- State:
- Created a year ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
[CLI] Wandb crashes when trying to launch a pytorch-lightning ...
I get this error: Problem at: ... be frozen to produce an executable. wandb: ERROR Abnormal program exit Traceback (most recent call last): ......
Read more >Programs randomly freeze-crashing with no errors in Windows ...
Programs randomly freeze-crashing with no errors in Windows 10. Hi. I've been keeping the latest updates for windows 10 and my drivers.
Read more >When an application crash without output an error, is there a ...
Depends on the application. Different applications have different logging systems; there's no one central log that contains all the output from all the ......
Read more >Enterprise Application crashes whi… | Apple Developer Forums
Hello Folks, Our enterprise application works fine in iOS 14, but when launched on iOS 15 it crashes instantly without giving any errors....
Read more >My Program, Game, or Other Software is Closing to Desktop
This problem can be caused by a wide range of different issues, and with no specific error, it can be difficult to troubleshoot....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Awesome! Thanks for updating us on this.
Same issue. I have removed the pip version and installed the conda version, but the same issue.
I have found out the problem. It seems that while debugging the DDP approach (as I was doing it manually). I have left the following:
In
python3.9/multiprocessing/popen_spawn_posix.py
line46
- (Popen/_launch
).This isn’t a bug in
wandb
nor inPyTorch Lightning
but I have processed the bug while debugging DDP to build my custom version. It is interesting how other loggers nor the PL’s DDP have encountered this issue.Now everything is working! 😄
I do apologize for the confusion. Thanks for your help @morganmcg1 @manangoel99 @akihironitta (I will close the issue).