[CLI]: sweep agent crashes during wandb.init()
See original GitHub issueDescribe the bug
The sweep agent crashes during wandb.init(), which is very probably caused by a server-side modification. This afternoon (13:00 - 15:00 CET) I was running a sweep as usual but when I wanted to start a very similar sweep later (after 18:30 CET) it didn’t work anymore. Even running an exact copy of the pervious sweep with the same sweep.yaml, as well as code and no package updates, resulted in a crash. I therefore strongly suspect that it was caused by a server-side modification.
From the traceback below, I investigated sender.py and found that history
is returns a list of dictionaries which does not need to be parsed again:
history = json.loads(resume_status["historyTail"])
if history:
history = json.loads(history[-1])
I don’t know why this started to occur today and what the correct behavior would be, but removing the json.loads
resolved the issue for me as a workaround. I don’t have code to reproduce the issue, but I would assume that it occurs with any project using a history. Just to be sure, my minimal sweep.yaml looks like this but guess it won’t change much:
method: grid
metric:
goal: maximize
name: probe/student.max
parameters:
enc:
value: vgg11
program: dino.py
Thanks in advance and best wishes, Felix
Thread SenderThread: wandb.init()...
Traceback (most recent call last):
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 49, in run
self._run()
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 100, in _run
self._process(record)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/internal.py", line 309, in _process
self._sm.send(record)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 305, in send
send_handler(record)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 770, in send_run
error = self._maybe_setup_resume(run)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 629, in _maybe_setup_resume
history = json.loads(history[-1]) # ORIGINAL
File "/usr/lib/python3.8/json/__init__.py", line 341, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
2022-12-12 21:00:37,743 - wandb.wandb_agent - INFO - Running runs: ['w2ptq40k']
wandb: ERROR Internal wandb error: file data was not synced
Problem at: /local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py 357 experiment
Traceback (most recent call last):
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 697, in init
result = handle.wait(
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 259, in wait
raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
run = wi.init()
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 697, in init
result = handle.wait(
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 259, in wait
raise MailboxError("transport failed")
wandb.errors.MailboxError: transport failed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dino.py", line 254, in <module>
main(config)
File "dino.py", line 38, in main
wandb_logger = WandbLogger(
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 311, in __init__
_ = self.experiment
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
return fn(*args, **kwargs)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 357, in experiment
self._experiment = wandb.init(**self._wandb_init)
File "/local/home/safelix/venv/dino/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1115, in init
raise Exception("problem") from error_seen
Exception: problem
Additional Files
No response
Environment
WandB version: 0.13.6
OS: Ubuntu 20.04.2 LTS
Python version: Python 3.8.5
Versions of relevant libraries: json.__version__ = '2.0.9'
Additional Context
No response
Issue Analytics
- State:
- Created 9 months ago
- Comments:7
Hi @safelix , this stemmed from a recent server side change. The issue has now been resolved. Please do let us know if you encounter any other problems.
Thanks for creating the issue. I have the same problem!