question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python: Trying to resume sweep after crashing due to CUDA memory error

See original GitHub issue

I am running sweeps on a project with simpletransformers. After running some sweeps the program fails with this

andb: WARNING Ignored wandb.init() arg project when running a sweep
Running Epoch 0 of 3:   0%|                                                                                                                                                | 0/1739 [00:10<?, ?it/s]
Epoch 1 of 3:   0%|                                                                                                                                                           | 0/3 [00:11<?, ?it/s]

wandb: Waiting for W&B process to finish, PID 53857
wandb: Program failed with code 1.  Press ctrl-c to abort syncing.
wandb:                                                                                
wandb: Find user logs for this run at: /root/wandb/run-20201116_145639-vsy1bnpn/logs/debug.log
wandb: Find internal logs for this run at: /root/wandb/run-20201116_145639-vsy1bnpn/logs/debug-internal.log
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: 
wandb: Synced ethereal-sweep-5: https://wandb.ai/odrec/Classification%20Model%20Comparison/runs/vsy1bnpn
wandb: ERROR Run vsy1bnpn errored: RuntimeError('Caught RuntimeError in replica 0 on device 0.\nOriginal Traceback (most recent call last):\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker\n    output = module(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n    result = self.forward(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/simpletransformers/classification/transformer_models/bert_model.py", line 57, in forward\n    outputs = self.bert(\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n    result = self.forward(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 833, in forward\n    encoder_outputs = self.encoder(\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n    result = self.forward(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 476, in forward\n    layer_outputs = layer_module(\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n    result = self.forward(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 422, in forward\n    layer_output = apply_chunking_to_forward(\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1696, in apply_chunking_to_forward\n    return forward_fn(*input_tensors)\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 429, in feed_forward_chunk\n    intermediate_output = self.intermediate(attention_output)\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl\n    result = self.forward(*input, **kwargs)\n  File "/root/complex_semantics/lib/python3.8/site-packages/transformers/modeling_bert.py", line 357, in forward\n    hidden_states = self.intermediate_act_fn(hidden_states)\n  File "/root/complex_semantics/lib/python3.8/site-packages/torch/nn/functional.py", line 1369, in gelu\n    return torch._C._nn.gelu(input)\nRuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 10.92 GiB total capacity; 9.11 GiB already allocated; 18.00 MiB free; 9.30 GiB reserved in total by PyTorch)\n')

So I am trying to resume by loading the previos sweep id and calling again the function

wandb.agent(sweep_id, function=lambda: self.train(train_args))

This is how the sweep is configured

       sweep_config = {
            "name": self.model_name,
            "method": "bayes",
            "metric": {"name": "mcc", "goal": "maximize"},
            "parameters": {
                "num_train_epochs": {"min": 1, "max": 10},
                "learning_rate": {"min": 0, "max": 4e-4},
            },
            "early_terminate": {"type": "hyperband", "min_iter": 6,},
        }
        sweep_id = wandb.sweep(sweep_config, project="Classification Model Comparison")
        return sweep_config, sweep_id

But when I try to resume using the id I get this error

Traceback (most recent call last):
  File "main.py", line 217, in <module>
    SM.main()
  File "/root/semantic_models.py", line 186, in main
    wandb.agent(sweep_id, function=lambda: self.train(train_args))
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/wandb_agent.py", line 556, in agent
    return pyagent(sweep_id, function, entity, project, count)
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 343, in pyagent
    agent.run()
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 319, in run
    self._setup()
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 140, in _setup
    self._register()
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/agents/pyagent.py", line 117, in _register
    agent = self._api.register_agent(socket.gethostname(), sweep_id=self._sweep_id)
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/internal.py", line 89, in register_agent
    return self.api.register_agent(*args, **kwargs)
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/normalize.py", line 62, in wrapper
    six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
  File "/root/.local/lib/python3.8/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/apis/normalize.py", line 24, in wrapper
    return func(*args, **kwargs)
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1274, in register_agent
    response = self.gql(
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/old/retry.py", line 105, in __call__
    if not check_retry_fn(e):
  File "/root/complex_semantics/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1272, in no_retry_4xx
    raise UsageError(body["errors"][0]["message"])
wandb.errors.error.CommError: could not find sweep odrec/uncategorized/o0gdpg7u during createAgent

As you can see the program is looking for the sweep on odrec/uncategorized/o0gdpg7u but the sweep is actually on odrec/Classification Model Comparison/sweeps/o0gdpg7u.

How can I tell it to look for the sweep not in the uncategorized url?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
vanpeltcommented, Nov 16, 2020

You can specify the “project” keyword in your call to wandb.agent

1reaction
vanpeltcommented, Feb 4, 2021

Hey @RGring, when using the function form of the agent, you must pass in only the id instead of the full path, i.e.

wandb.agent("aiehwvid", project="retry-deadlock", function=train)

Calling .agent will never create a new sweep. Sweeps can only be created via the UI or via wandb sweep or wandb.sweep. If you can provide this ticket with an exact set of steps you’re try to to follow that is causing an error for you we can get to the bottom of what’s up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trying to resume sweep after crashing due to CUDA memory ...
Python : Trying to resume sweep after crashing due to CUDA memory error #1501. Closed. Odrec opened this issue on Nov 16, 2020...
Read more >
"RuntimeError: CUDA error: out of memory" - Stack Overflow
The error occurs because you ran out of memory on your GPU. One way to solve it is to reduce the batch size...
Read more >
Wandb - Runtimeerror: Cuda Out Of Memory - ADocLib
If present, training will resume from the model/optimizer/scheduler states ... Python: Trying to resume sweep after crashing due to CUDA memory error #1501....
Read more >
How to stop Lumerical FDTD from crashing on a cluster
This appears like a memory issue, but I have allocated hundreds of GB per node and ensured that Its plenty based on the...
Read more >
FAQ - Documentation - Weights & Biases - Wandb
For runs that are not part of a sweep, the values of wandb.config are usually ... Delete the W&B Runs ones you want...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found