wandb hangs experiment (10 min+): Internal Server Error for url: https://api.wandb.ai/graphql
See original GitHub issuewandb --version && python --version && uname
- Weights and Biases version: 0.8.35
- Python version: 3.7
- Operating System: Linux
Description
For a few days, I noticed experiments hanging on wandb logging. Sometimes I even saw crashes.
So far, downgrading to 0.8.33 seems to help. Will report if the problem arises again.
What I Did
2020-05-05 09:25:36,056 ERROR Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:36,057 ERROR Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}
2020-05-05 09:25:36,058 ERROR Thread-18 :22373 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
result = self._call_fn(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/wandb/apis/internal.py", line 116, in execute
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/wandb/apis/internal.py", line 110, in execute
return self.client.execute(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/gql/transport/requests.py", line 39, in execute
request.raise_for_status()
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql
2020-05-05 09:25:39,928 INFO Thread-3 :22373 [run_manager.py:_on_file_modified():691] file/dir modified: <redacted>/run-20200505_072213-camelyon-16384-full-correct-loss/wandb-metadata.json
2020-05-05 09:25:40,883 ERROR Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:40,884 ERROR Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}
2020-05-05 09:25:50,866 ERROR Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:50,867 ERROR Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}
2020-05-05 09:25:51,143 WARNING Thread-7 :22373 [util.py:request_with_retry():614] requests_with_retry encountered retryable exception: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/files/hanspinckaers/camelyon/camelyon-16384-full-correct-loss/file_stream. args: ('https://api.wandb.ai/files/hanspinckaers/camelyon/camelyon-16384-full-correct-loss/file_stream',), kwargs: {'json': {'files': {'output.log':
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Sign up - GitHub
wandb hangs experiment (10 min+): Internal Server Error for url: https://api.wandb.ai/graphql #1016. Closed. HansPinckaers opened this issue on ...
Read more >Troubleshooting - Documentation - Weights & Biases - Wandb
This is likely a connection problem — if your server loses internet access and data stops syncing to W&B, we mark the run...
Read more >Troubleshooting - Documentation - Weights & Biases - WandB
We run wandb in a separate process to make sure that if wandb somehow crashes, your training will continue to run. If the...
Read more >Topics tagged wandb
Topic Replies Views Activity
100% offline sweep · W&B Help · sweeps , wandb 3 133 December 1, 2022
ERROR Abnormal program exit · W&B...
Read more >Technical FAQ - Documentation - Weights & Biases
Frequently Asked Questions. General · Metrics & Performance · Setup · Troubleshooting · Previous. FAQ · Next. General. Last modified 6mo ago. Cookies....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@richardrl we had an outage last night that caused these errors. Everything should be functioning properly now.
Hi, I’m experiencing this issue in Google Colab environment. To reproduce:
then
output: