Failed or Offline run stuck indefinitely while syncing
See original GitHub issueI had a gpu lost error and even in offline mode wandb is stuck at syncing and exiting. Happened more than once, I had to close the terminal entirely each time to stop the process.
Exception in thread Thread-4: Traceback (most recent call last): File “/usr/lib/python3.6/threading.py”, line 916, in _bootstrap_inner self.run() File “/usr/lib/python3.6/threading.py”, line 864, in run self._target(*self._args, **self._kwargs) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/internal/stats.py”, line 98, in _thread_body stats = self.stats() File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/internal/stats.py”, line 140, in stats handle = pynvml.nvmlDeviceGetHandleByIndex(i) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/vendor/pynvml/pynvml.py”, line 819, in nvmlDeviceGetHandleByIndex _nvmlCheckReturn(ret) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/vendor/pynvml/pynvml.py”, line 310, in _nvmlCheckReturn raise NVMLError(ret) wandb.vendor.pynvml.pynvml.NVMLError_GpuIsLost: GPU is lost
wandb: Offline run mode, not syncing to the cloud.
wandb: Tracking run with wandb version 0.10.2
wandb: W&B is disabled in this directory. Run wandb on
to enable cloud syncing.
wandb: Run data is saved locally in wandb/offline-run-20200925_154531-23nxnmk1
Traceback (most recent call last): File “/home/saurav/Sowmen_DL/image_manipulation/train_attn.py”, line 416, in <module> patch_size=patch_size, File “/home/saurav/Sowmen_DL/image_manipulation/train_attn.py”, line 69, in train model = EfficientNet(‘tf_efficientnet_b4_ns’).to(device) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 607, in to return self._apply(convert) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 376, in _apply param_applied = fn(param) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 605, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: out of memory
wandb: Waiting for W&B process to finish, PID 2033 wandb: Program failed with code 1. wandb: Find user logs for this run at: wandb/offline-run-20200925_154531-23nxnmk1/logs/debug.log wandb: Find internal logs for this run at: wandb/offline-run-20200925_154531-23nxnmk1/logs/debug-internal.log wandb: You can sync this run to the cloud by running: wandb: wandb sync wandb/offline-run-20200925_154531-23nxnmk1
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Issue-Label Bot is automatically applying the label
bug
to this issue, with a confidence of 0.98. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
Hey @adarshbhandaryp,
A “GPU is lost” error, most usually has to do with your system’s settings. I would recommend checking your BIOS settings, making sure your computer is not overclocked and that your GPU is seated in your PCIe slot well.
Thanks, Ramit