question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed or Offline run stuck indefinitely while syncing

See original GitHub issue

I had a gpu lost error and even in offline mode wandb is stuck at syncing and exiting. Happened more than once, I had to close the terminal entirely each time to stop the process.

Exception in thread Thread-4: Traceback (most recent call last): File “/usr/lib/python3.6/threading.py”, line 916, in _bootstrap_inner self.run() File “/usr/lib/python3.6/threading.py”, line 864, in run self._target(*self._args, **self._kwargs) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/internal/stats.py”, line 98, in _thread_body stats = self.stats() File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/internal/stats.py”, line 140, in stats handle = pynvml.nvmlDeviceGetHandleByIndex(i) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/vendor/pynvml/pynvml.py”, line 819, in nvmlDeviceGetHandleByIndex _nvmlCheckReturn(ret) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/wandb/vendor/pynvml/pynvml.py”, line 310, in _nvmlCheckReturn raise NVMLError(ret) wandb.vendor.pynvml.pynvml.NVMLError_GpuIsLost: GPU is lost

wandb: Offline run mode, not syncing to the cloud. wandb: Tracking run with wandb version 0.10.2 wandb: W&B is disabled in this directory. Run wandb on to enable cloud syncing. wandb: Run data is saved locally in wandb/offline-run-20200925_154531-23nxnmk1

Traceback (most recent call last): File “/home/saurav/Sowmen_DL/image_manipulation/train_attn.py”, line 416, in <module> patch_size=patch_size, File “/home/saurav/Sowmen_DL/image_manipulation/train_attn.py”, line 69, in train model = EfficientNet(‘tf_efficientnet_b4_ns’).to(device) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 607, in to return self._apply(convert) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 354, in _apply module._apply(fn) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 376, in _apply param_applied = fn(param) File “/home/saurav/Sowmen_DL/dfdc-env/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 605, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: out of memory

wandb: Waiting for W&B process to finish, PID 2033 wandb: Program failed with code 1. wandb: Find user logs for this run at: wandb/offline-run-20200925_154531-23nxnmk1/logs/debug.log wandb: Find internal logs for this run at: wandb/offline-run-20200925_154531-23nxnmk1/logs/debug-internal.log wandb: You can sync this run to the cloud by running: wandb: wandb sync wandb/offline-run-20200925_154531-23nxnmk1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
issue-label-bot[bot]commented, Sep 25, 2020

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.98. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

0reactions
ramit-wandbcommented, Jan 4, 2022

Hey @adarshbhandaryp,

A “GPU is lost” error, most usually has to do with your system’s settings. I would recommend checking your BIOS settings, making sure your computer is not overclocked and that your GPU is seated in your PCIe slot well.

Thanks, Ramit

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failed or Offline run stuck indefinitely while syncing #1280
I had a gpu lost error and even in offline mode wandb is stuck at syncing and exiting. Happened more than once, I...
Read more >
Offline Files Sync Pending in Windows 10? Try 5 Solutions Now!
Method 1: Disconnect and reconnect your network drive. Syncing pending error may occur when your network drive is not connected well. Therefore, you...
Read more >
OneDrive is stuck on Processing changes - Microsoft Support
OneDrive is stuck on Processing changes ; You're syncing a lot of files. This can be resolved by pausing and then resuming your...
Read more >
Control is stuck "cloud syncing" : r/EpicGamesPC - Reddit
I found a solution possibly. I turned off cloud sync. Force quit the EGS then played Control for a few hours. When I...
Read more >
LOADING... Syncing Data for Subnautica... 'Try again' 'Use ...
Unfortunate fix for sync error: press MENU on Subnautica, select manage game and add ons, go to saved data, and delete from everywhere....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found