Multi-GPU sweep fails to create a second agent with 400 error
See original GitHub issueDescribe the bug
Multi-GPU sweep fails to create a second agent with error attached. The first agent starts as expected.
The sweep ID is qnjyxjtu
.
To Reproduce Steps to reproduce the behavior: CUDA_VISIBLE_DEVICES=0 wandb agent kaiwenw/CartPole-v0/qnjyxjtu CUDA_VISIBLE_DEVICES=1 wandb agent kaiwenw/CartPole-v0/qnjyxjtu
Expected behavior Expect two agents to start, and both log to the same sweep.
Operating System Distributor ID: Ubuntu Description: Ubuntu 18.04.4 LTS Release: 18.04 Codename: bionic With 2 Nvidia GPUs on CUDA 10.2 running PyTorch-Lightning v1.1, PyTorch v1.8 Note the issue persists even without using GPU.
Additional context Full trace:
(torch) ubuntu@ip-172-31-10-97:~/oppg$ CUDA_VISIBLE_DEVICES=0 wandb agent kaiwenw/CartPole-v0/qnjyxjtu
wandb: Starting wandb agent 🕵️
2020-12-14 09:13:58,062 - wandb.wandb_agent - INFO - Running runs: []
400 response executing GraphQL.
{"errors":[{"message":"ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() (<Response [400]>)
wandb: Terminating and syncing runs. Press ctrl-c to kill.
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1317, in agent_heartbeat
response = self.gql(
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/old/retry.py", line 96, in __call__
result = self._call_fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 136, in execute
six.reraise(*sys.exc_info())
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 130, in execute
return self.client.execute(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
request.raise_for_status()
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/requests/models.py", line 943, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.wandb.ai/graphql
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/torch/bin/wandb", line 8, in <module>
sys.exit(cli())
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/cli/cli.py", line 86, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/cli/cli.py", line 806, in agent
wandb_agent.agent(sweep_id, entity=entity, project=project, count=count)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/wandb_agent.py", line 558, in agent
return run_agent(
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/wandb_agent.py", line 518, in run_agent
agent.run()
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/wandb_agent.py", line 252, in run
commands = self._api.agent_heartbeat(agent_id, {}, run_status)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/apis/internal.py", line 92, in agent_heartbeat
return self.api.agent_heartbeat(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 1328, in agent_heartbeat
message = ast.literal_eval(e.args[0])["message"]
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/home/ubuntu/miniconda3/envs/torch/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
400 Client Error: Bad Request for url: https://api.wandb.ai/graphql
^
SyntaxError: invalid syntax
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Multi-GPU sweep fails to create a second agent with 400 error
Describe the bug Multi-GPU sweep fails to create a second agent with error attached. The first agent starts as expected.
Read more >FAQ - Documentation - Weights & Biases - Wandb
Finally, start new W&B Sweep agents with the new Sweep ID. Parameter combinations with completed W&B Runs are not re-executed.
Read more >Problems with multi-gpus - MATLAB Answers - MathWorks
Error using Composite/subsasgn (line 103). An invalid indexing request was made. Struct contents reference from a non-struct array object.
Read more >How do I set GPU affinity of workers - RLlib - Ray.io
Lets say I am using IMPALA with several workers and I have 4 GPUs. I want to be able to say GPUs 0,1,2...
Read more >CUDA_ERROR_OUT_OF_MEM...
It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Experiencing the same issue here while doing sweep - first run ok, after that it throws the error above. I have something like the following in my sweep
The doc link above does not provide any clue on dealing with using list as value.
From the error it seems like the issue lies with how the call to W&B API parses the value - perhaps a guard in the API / API call that guards for list values? Or if the API could accept list?
A temporary workaround is to put the list values as strings, then in the trainer process parse the string into list again. This is definitely not ideal tho.
In the past year we’ve majorly reworked the CLI and UI for Weights & Biases. We’re closing issues older than 6 months. Please comment to reopen.