Python: Trying to resume sweep after crashing due to CUDA memory error
See original GitHub issueI am running sweeps on a project with simpletransformers. After running some sweeps the program fails with this
andb: WARNING Ignored wandb.init() arg project when running a sweep
Running Epoch 0 of 3: 0%| | 0/1739 [00:10<?, ?it/s]
Epoch 1 of 3: 0%| | 0/3 [00:11<?, ?it/s]
wandb: Waiting for W&B process to finish, PID 53857
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb:
wandb: Find user logs for this run at: /root/wandb/run-20201116_145639-vsy1bnpn/logs/debug.log
wandb: Find internal logs for this run at: /root/wandb/run-20201116_145639-vsy1bnpn/logs/debug-internal.log
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced ethereal-sweep-5: https://wandb.ai/odrec/Classification%20Model%20Comparison/runs/vsy1bnpn
wandb: ERROR Run vsy1bnpn errored: RuntimeError('Caught RuntimeError in replica 0 on device 0.\nOriginal Traceback (most recent call last):\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker\n output = module(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n result = self.forward(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/simpletransformers/classification/transformer_models/bert_model.py", line 57, in forward\n outputs = self.bert(\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n result = self.forward(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 833, in forward\n encoder_outputs = self.encoder(\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n result = self.forward(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 476, in forward\n layer_outputs = layer_module(\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n result = self.forward(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 422, in forward\n layer_output = apply_chunking_to_forward(\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1696, in apply_chunking_to_forward\n return forward_fn(*input_tensors)\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 429, in feed_forward_chunk\n intermediate_output = self.intermediate(attention_output)\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n result = self.forward(*input, **kwargs)\n File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 357, in forward\n hidden_states = self.intermediate_act_fn(hidden_states)\n File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/functional.py", line 1369, in gelu\n return torch._C._nn.gelu(input)\nRuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 10.92 GiB total capacity; 9.11 GiB already allocated; 18.00 MiB free; 9.30 GiB reserved in total by PyTorch)\n')
So I am trying to resume by loading the previos sweep id and calling again the function
wandb.agent(sweep_id, function=lambda: self.train(train_args))
This is how the sweep is configured
sweep_config = {
"name": self.model_name,
"method": "bayes",
"metric": {"name": "mcc", "goal": "maximize"},
"parameters": {
"num_train_epochs": {"min": 1, "max": 10},
"learning_rate": {"min": 0, "max": 4e-4},
},
"early_terminate": {"type": "hyperband", "min_iter": 6,},
}
sweep_id = wandb.sweep(sweep_config, project="Classification Model Comparison")
return sweep_config, sweep_id
But when I try to resume using the id I get this error
Traceback (most recent call last):
File "main.py", line 217, in <module>
SM.main()
File "/root/semantic_models.py", line 186, in main
wandb.agent(sweep_id, function=lambda: self.train(train_args))
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/wandb_agent.py", line 556, in agent
return pyagent(sweep_id, function, entity, project, count)
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 343, in pyagent
agent.run()
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 319, in run
self._setup()
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 140, in _setup
self._register()
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 117, in _register
agent = self._api.register_agent(socket.gethostname(), sweep_id=self._sweep_id)
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/internal.py", line 89, in register_agent
return self.api.register_agent(*args, **kwargs)
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/normalize.py", line 62, in wrapper
six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
File "/root/.local/lib/python3.8/site-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/normalize.py", line 24, in wrapper
return func(*args, **kwargs)
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1274, in register_agent
response = self.gql(
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/old/retry.py", line 105, in __call__
if not check_retry_fn(e):
File "/root/complex_semantics/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1272, in no_retry_4xx
raise UsageError(body["errors"][0]["message"])
wandb.errors.error.CommError: could not find sweep odrec/uncategorized/o0gdpg7u during createAgent
As you can see the program is looking for the sweep on odrec/uncategorized/o0gdpg7u but the sweep is actually on odrec/Classification Model Comparison/sweeps/o0gdpg7u.
How can I tell it to look for the sweep not in the uncategorized url?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Trying to resume sweep after crashing due to CUDA memory ...
Python : Trying to resume sweep after crashing due to CUDA memory error #1501. Closed. Odrec opened this issue on Nov 16, 2020...
Read more >"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >Wandb - Runtimeerror: Cuda Out Of Memory - ADocLib
If present, training will resume from the model/optimizer/scheduler states ... Python: Trying to resume sweep after crashing due to CUDA memory error #1501....
Read more >How to stop Lumerical FDTD from crashing on a cluster
This appears like a memory issue, but I have allocated hundreds of GB per node and ensured that Its plenty based on the...
Read more >FAQ - Documentation - Weights & Biases - Wandb
For runs that are not part of a sweep, the values of wandb.config are usually ... Delete the W&B Runs ones you want...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You can specify the “project” keyword in your call to
wandb.agent
Hey @RGring, when using the function form of the agent, you must pass in only the id instead of the full path, i.e.
Calling
.agent
will never create a new sweep. Sweeps can only be created via the UI or viawandb sweep
orwandb.sweep
. If you can provide this ticket with an exact set of steps you’re try to to follow that is causing an error for you we can get to the bottom of what’s up.