[CLI]: Wandb causes training to error - Dropped streaming file chunk
See original GitHub issueDescribe the bug
After training for several hours, capturing allot of data points wandb errors
import wandb
wandb.init(project="continous de rl", entity="samholt")
wandb.config.update({'learning_rate': 1e-3})
total_epochs = 10000000000
# Create model
model = get_model()
wandb.config.update({"model_dyn_parameters": scalar_count_of_model_parameters})
# Training loop
for epoch_i in range(total_epochs):
wandb.log({"loss": track_loss, "epoch": epoch_i})
wandb.watch(model)
The above model has about 18,000 - 250,000 parameters depending on the experiment. All experiments face the following issue of wandb erroring after training for a few hours (3 or so).
07:50:48,303 root INFO [epoch=0402|iter=0501] train_loss=0.04047 | s/it=0.11040
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)
Traceback (most recent call last):
File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 223, in <module>
logger.info(get_NL_worldmodel(retrain=True))
File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 166, in get_NL_worldmodel
loss_dyn_.backward()
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 264, in <lambda>
handle = var.register_hook(lambda grad: _callback(grad, log_track))
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 262, in _callback
self.log_tensor_stats(grad.data, name)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 238, in log_tensor_stats
wandb.run._log(
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1375, in _log
self._partial_history_callback(data, step, commit)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1259, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 553, in publish_partial_history
self._publish_partial_history(partial_history)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 67, in _publish_partial_history
self._publish(rec)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 150, in send_record_publish
self.send_server_request(server_req)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 84, in send_server_request
self._send_message(msg)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
self._sendall_with_error_handle(header + data)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle
sent = self._sock.send(data[total_sent:])
ConnectionResetError: [Errno 104] Connection reset by peer
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
self._sendall_with_error_handle(header + data)
File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle
sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe
Additional Files
debug-internal.log shows
2022-08-23 07:50:51,832 DEBUG HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,833 DEBUG HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,834 DEBUG HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:52,180 DEBUG SenderThread:462244 [sender.py:send():302] send: history
2022-08-23 07:50:53,904 INFO Thread-12 :462244 [dir_watcher.py:_on_file_modified():292] file/dir modified: /home/sam/code/continous_control/continousderl/wandb/run-20220823_040512-26dgjkcf/files/output.log
2022-08-23 07:50:55,665 DEBUG SenderThread:462244 [sender.py:send():302] send: summary
2022-08-23 07:50:55,804 DEBUG HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:55,923 DEBUG HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
Environment
WandB version: ‘0.13.1’
OS: Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una
Python version: Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux
Versions of relevant libraries:
Additional Context
Please do let me know how I can solve this problem, as if wandb causes training runs to crash, then this makes it unusable.
Thank you so much for helping and your time.
Best, Sam
Issue Analytics
- State:
- Created a year ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
[CLI] wandb sync fails to upload reports from crashed scripts ...
Describe the bug If I run a script and then terminate it (ex. Ctrl-c) or it crashes for misc. reasons, I cannot use...
Read more >Troubleshooting - Documentation - Weights & Biases - Wandb
We run wandb in a separate process to make sure that if wandb somehow crashes, your training will continue to run. If the...
Read more >Issues · wandb/wandb · GitHub
This repo contains the CLI and Python API. - Issues · wandb/wandb. ... [CLI]: Wandb causes training to error - Dropped streaming file...
Read more >Weights and Biases: Login and network errors - Stack Overflow
This error happens when I use the command: wandb login host=https://api.wandb.ai I have tried ... line 677, in urlopen chunked=chunked, File ...
Read more >Fine-tuning - API - OpenAI
Fine-tuning improves on few-shot learning by training on many more ... You can use our CLI data preparation tool to easily convert your...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Hi @samholt!
That does not seem to be the whole
debug-internal.log, would it be possible for you to share the complete log for me to have a look through once? Also wanted to know if you are doing something special / unique with multiprocessing.I’ll help investigate this and get a solution going for you.
Thanks, Ramit
Dear @ramit-wandb,
I fixed it by not logging so much data all the time, by removing the line
wandb.watch(model).Thank you so much for the help as well,
I love weights and biases and consistently using it for all my work !
Keep up the great work,
Best, Sam