question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Network error (ReadTimeout), entering retry loop during sweep

See original GitHub issue
  • Weights and Biases version: 0.9.1
  • Python version: 3.7.7
  • Operating System: MacOS X / Debian

Description

I started a wandb sweep grid search and I am getting a lot of Network error (ReadTimeout), entering retry loop. from various kinds of machines within different kind of networks (i.e. this happens both at home on my laptop and on larger remote computers). Sometimes runs get through and sync but then at seemingly random times the network error shows up. Sometimes it resolves itself within a minute or two sometimes it just stays down for half an hour or longer. I’ve posted the log below.

What I Did

2020-08-04 16:35:53,246 ERROR   MainThread:13897 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 725, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/wandb/apis/internal.py", line 108, in execute
    return self.client.execute(*args, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/gql/transport/requests.py", line 38, in execute
    request = requests.post(self.url, **post_args)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/Users/forduniver/anaconda3/envs/test_repo/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (6 by maintainers)

github_iconTop GitHub Comments

8reactions
issue-label-bot[bot]commented, Aug 4, 2020

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.85. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

1reaction
raubitsjcommented, Apr 30, 2021

@XikunZhang , im sorry about the error you experienced today. We had a database outage for ~40 minutes which is likely when you saw this error. https://status.wandb.com/incidents/s70l1pfsyz6c

During this time, any sweeps that were running likely were reporting errors, but the sweep should continue to run and it would recover at the end of the outage.

Any actively running training jobs started by the sweep should continue running during the outage but they will be reporting messages about failures, but they will recover at the end of the outage.

When the backend is not responding, by default we do not let new training jobs start because we want to make sure the data is saved. If you want to run offline and sync the data later, you can chose that with wandb offline or WANDB_MODE=“offline”, or wandb.init(mode=“offline”)

We have a project to improve the error messages so it is more user readable. We have included this in 0.10.28 release in the case of rate-limited requests and will be doing it for other types of network errors in the future.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Network error (ReadTimeout), entering retry loop during ...
I started a wandb sweep grid search and I am getting a lot of Network error (ReadTimeout), entering retry loop. from various kinds...
Read more >
Weights and Biases: Login and network errors
This error happens when I use the command: wandb login host=https://api.wandb.ai I have tried to delete the . netrc file where the api...
Read more >
Troubleshooting - Documentation - Weights & Biases - Wandb
If our library is unable to connect to the internet it will enter a retry loop and keep attempting to stream metrics until...
Read more >
"just found out that `wandb offline` ...
wandb: Network error (ReadTimeout), entering retry loop. ... just found out that `wandb offline` can be used to temporarily work around the ...
Read more >
PyTorch Sweep - Colaboratory
In this tutorial we'll see how you can run sophisticated hyperparameter ... (read timeout=10) wandb: Network error (ReadTimeout), entering retry loop.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found