question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sweeps erroring out after first run with Anaconda 400 and Graphql 500 codes

See original GitHub issue

I’ve seen a lot of issues related to variants of this problem, but none of those with answers fixed my problem. Here’s what I’m doing:

  1. wandb sweep lstm_wandb_sweep.yml
  2. wandb agent <SWEEPID>
  3. Errors out after first run is finished

Here’s the traceback I’m seeing:

2020-10-06 21:23:10,589 - wandb.wandb_agent - INFO - Cleaning up finished run: lpt97394
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR  (<Response [500]>)
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR  (<Response [500]>)
500 response executing GraphQL.
{"errors":[{"message":"anaconda 400 error: {\"code\":400,\"message\":\"ValueError: False is not in list\"}\n","path":["agentHeartbeat"]}],"data":{"agentHeartbeat":null}}
wandb: ERROR Error while calling W&B API: anaconda 400 error: {"code":400,"message":"ValueError: False is not in list"}
wandb: ERROR  (<Response [500]>)
Retry attempt failed:
Traceback (most recent call last):
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/old/retry.py", line 96, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/internal/internal_api.py", line 134, in execute
    six.reraise(*sys.exc_info())
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/internal/internal_api.py", line 128, in execute
    return self.client.execute(*args, **kwargs)
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "/opt/tmp/anaconda3/envs/DIU_NORAD/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql
wandb: Network error (HTTPError), entering retry loop. See wandb/debug-internal.log for full traceback.
500 response executing GraphQL.

It then essentially keeps repeating those same 400-then-500 error messages until it just locks up (keyboard interrupt will break it out of the cycle). My YAML config file content:

program: train_lstm_rnn_lightning.py
method: bayes # grid, random, or bayes
metric:
  goal: minimize # maximize, minimize
  name: val_loss #epoch?
  # can include ``target`` value if you know it too
early_terminate: # not super-intuitive what these do, but sounds good!
  type: hyperband
  max_iter: 27
  s: 2
  eta: 3
parameters:
  gradient_clip_val:
    max: 10.0
    min: 0.1
    distribution: uniform
  dropout_probability:
    max: 0.5
    min: 0.1
    distribution: uniform
  learning_rate:
    max: 0.004
    min: 0.0005
    distribution: uniform
  bidirectional:
    values:
      - "true"
      - "false"
    distribution: categorical
  batch_size:
    values: [128,256]
  hidden_dim:
    distribution: q_normal # q=1 just makes it select integers
    #mu: 250 # roughly end up with it selecting between 100 and 400
    #sigma: 50
    mu: 25 # remove when done testing sweeps for debugging purposes
    sigma: 10 # remove when done testing sweeps for debugging purposes
    q: 1
  n_layers:
    values: [2,3]

I’m definitely logging the loss metric as val_loss in my PyTorch-Lightning object, so the answer in the FAQ that is related to this does not appear to be relevant. Currently using wandb=0.10.4, pytorch-lightning=0.4.1, and pytorch=1.4.0 (all conda-installed). Any thoughts on what could be the problem?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
raubitsjcommented, Oct 6, 2020

@emigre459: This was indeed the problem, if you go to the overview tab of the run that finished and click on the Raw config you can see it in json form:

    "bidirectional": {
        "desc": null,
        "value": false
    },

Notice that false is not in quotes, this means it is the boolean value false. yet your sweep config is using the string “false”.

I think this is a side effect of either your script or the pytorch lightning integration overwriting the config and changing the type from string to boolean.

To resolve this, you should be able to specify boolean values in your sweep config:

Here is an example:

program: train-bool.py
method: grid
# Parameters to search over
parameters:
  optimizer:
    values: ["sgd", "adam"]
  flag1:
    values:
    - false
    - true
  flag2:
    values:
    - false
    - true

And if you are using argparse to parse values, you can use:

#!/usr/bin/env python
import wandb
import argparse
import time
parser = argparse.ArgumentParser()
parser.add_argument("--optimizer", default=None, type=str)
parser.add_argument("--flag1", default=False, type=lambda s: s.lower() == 'true')
parser.add_argument("--flag2", default=True, type=lambda s: s.lower() == 'true')
args = parser.parse_args()
run = wandb.init(config=args)
2reactions
issue-label-bot[bot]commented, Oct 6, 2020

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.90. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sweeps erroring out after first run with Anaconda 400 and ...
I've seen a lot of issues related to variants of this problem, but none of those with answers fixed my problem. Here's what...
Read more >
Troubleshooting
Navigator error on start up; Issues launching or initializing; PermissionError on macOS Catalina; Access denied error; Navigator buttons are missing ...
Read more >
Apollo-client returns "400 (Bad Request) Error" on sending ...
400 errors generally mean there's something off with the query itself. In this instance, you've defined (and you're passing in) a variable ...
Read more >
FAQ - Documentation - Weights & Biases - Wandb
Delete the W&B Runs ones you want to re-execute, then choose the Resume button on the sweep control page. Finally, start new W&B...
Read more >
Untitled
G celsis, 330 kit for 300ex, F1 after race concert 2014 abu dhabi, ... Prezzo voucher code on phone, Interfleet group, 20129 1st...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found