Sweeps erroring out after first run with Anaconda 400 and Graphql 500 codes
See original GitHub issueI’ve seen a lot of issues related to variants of this problem, but none of those with answers fixed my problem. Here’s what I’m doing:
wandb sweep lstm_wandb_sweep.ymlwandb agent <SWEEPID>- Errors out after first run is finished
Here’s the traceback I’m seeing:
2020-10-06 21:23:10,589 - wandb.wandb_agent - INFO - Cleaning up finished run: lpt97394
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR (<Response [500]>)
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR (<Response [500]>)
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR (<Response [500]>)
Retry attempt failed:
Traceback (most recent call last):
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/old/retry.py", line 96, in __call__
result = self._call_fn(*args, **kwargs)
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/internal/internal_api.py", line 134, in execute
six.reraise(*sys.exc_info())
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/internal/internal_api.py", line 128, in execute
return self.client.execute(*args, **kwargs)
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
request.raise_for_status()
File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql
wandb: Network error (HTTPError), entering retry loop. See wandb/debug-internal.log for full traceback.
500 response executing GraphQL.
It then essentially keeps repeating those same 400-then-500 error messages until it just locks up (keyboard interrupt will break it out of the cycle). My YAML config file content:
program: train_lstm_rnn_lightning.py
method: bayes # grid, random, or bayes
metric:
goal: minimize # maximize, minimize
name: val_loss #epoch?
# can include ``target`` value if you know it too
early_terminate: # not super-intuitive what these do, but sounds good!
type: hyperband
max_iter: 27
s: 2
eta: 3
parameters:
gradient_clip_val:
max: 10.0
min: 0.1
distribution: uniform
dropout_probability:
max: 0.5
min: 0.1
distribution: uniform
learning_rate:
max: 0.004
min: 0.0005
distribution: uniform
bidirectional:
values:
- "true"
- "false"
distribution: categorical
batch_size:
values: [128,256]
hidden_dim:
distribution: q_normal # q=1 just makes it select integers
#mu: 250 # roughly end up with it selecting between 100 and 400
#sigma: 50
mu: 25 # remove when done testing sweeps for debugging purposes
sigma: 10 # remove when done testing sweeps for debugging purposes
q: 1
n_layers:
values: [2,3]
I’m definitely logging the loss metric as val_loss in my PyTorch-Lightning object, so the answer in the FAQ that is related to this does not appear to be relevant. Currently using wandb=0.10.4, pytorch-lightning=0.4.1, and pytorch=1.4.0 (all conda-installed). Any thoughts on what could be the problem?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Sweeps erroring out after first run with Anaconda 400 and ...
I've seen a lot of issues related to variants of this problem, but none of those with answers fixed my problem. Here's what...
Read more >Troubleshooting
Navigator error on start up; Issues launching or initializing; PermissionError on macOS Catalina; Access denied error; Navigator buttons are missing ...
Read more >Apollo-client returns "400 (Bad Request) Error" on sending ...
400 errors generally mean there's something off with the query itself. In this instance, you've defined (and you're passing in) a variable ...
Read more >FAQ - Documentation - Weights & Biases - Wandb
Delete the W&B Runs ones you want to re-execute, then choose the Resume button on the sweep control page. Finally, start new W&B...
Read more >Untitled
G celsis, 330 kit for 300ex, F1 after race concert 2014 abu dhabi, ... Prezzo voucher code on phone, Interfleet group, 20129 1st...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@emigre459: This was indeed the problem, if you go to the overview tab of the run that finished and click on the Raw config you can see it in json form:
Notice that
falseis not in quotes, this means it is the boolean value false. yet your sweep config is using the string “false”.I think this is a side effect of either your script or the pytorch lightning integration overwriting the config and changing the type from string to boolean.
To resolve this, you should be able to specify boolean values in your sweep config:
Here is an example:
And if you are using argparse to parse values, you can use:
Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.90. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.