question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CLI]: Wandb causes training to error - Dropped streaming file chunk

See original GitHub issue

Describe the bug

After training for several hours, capturing allot of data points wandb errors


import wandb
wandb.init(project="continous de rl", entity="samholt")
wandb.config.update({'learning_rate': 1e-3})

total_epochs = 10000000000

# Create model
model = get_model()

wandb.config.update({"model_dyn_parameters": scalar_count_of_model_parameters})

# Training loop
for epoch_i in range(total_epochs):
  wandb.log({"loss": track_loss, "epoch": epoch_i})
  wandb.watch(model)

The above model has about 18,000 - 250,000 parameters depending on the experiment. All experiments face the following issue of wandb erroring after training for a few hours (3 or so).

07:50:48,303 root INFO [epoch=0402|iter=0501] train_loss=0.04047        | s/it=0.11040
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)
Traceback (most recent call last):
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 223, in <module>
    logger.info(get_NL_worldmodel(retrain=True))
  File "/home/sam/code/continous_control/continousderl/worldmodel_nl.py", line 166, in get_NL_worldmodel
    loss_dyn_.backward()
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 264, in <lambda>
    handle = var.register_hook(lambda grad: _callback(grad, log_track))
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 262, in _callback
    self.log_tensor_stats(grad.data, name)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/wandb_torch.py", line 238, in log_tensor_stats
    wandb.run._log(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1375, in _log
    self._partial_history_callback(data, step, commit)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1259, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 553, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_shared.py", line 67, in _publish_partial_history
    self._publish(rec)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 150, in send_record_publish
    self.send_server_request(server_req)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 84, in send_server_request
    self._send_message(msg)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
ConnectionResetError: [Errno 104] Connection reset by peer
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 81, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/sam/anaconda3/envs/cderl/lib/python3.9/site-packages/wandb/sdk/lib/sock_client.py", line 61, in _sendall_with_error_handle 
    sent = self._sock.send(data[total_sent:])
BrokenPipeError: [Errno 32] Broken pipe

Additional Files

debug-internal.log shows

2022-08-23 07:50:51,832 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,833 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:51,834 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:52,180 DEBUG   SenderThread:462244 [sender.py:send():302] send: history
2022-08-23 07:50:53,904 INFO    Thread-12 :462244 [dir_watcher.py:_on_file_modified():292] file/dir modified: /home/sam/code/continous_control/continousderl/wandb/run-20220823_040512-26dgjkcf/files/output.log
2022-08-23 07:50:55,665 DEBUG   SenderThread:462244 [sender.py:send():302] send: summary
2022-08-23 07:50:55,804 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history
2022-08-23 07:50:55,923 DEBUG   HandlerThread:462244 [handler.py:handle_request():141] handle_request: partial_history

Environment

WandB version: ‘0.13.1’

OS: Distributor ID: Linuxmint Description: Linux Mint 20.3 Release: 20.3 Codename: una

Python version: Python 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] :: Anaconda, Inc. on linux

Versions of relevant libraries:

Additional Context

Please do let me know how I can solve this problem, as if wandb causes training runs to crash, then this makes it unusable.

Thank you so much for helping and your time.

Best, Sam

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ramit-wandbcommented, Aug 23, 2022

Hi @samholt!

That does not seem to be the whole debug-internal.log, would it be possible for you to share the complete log for me to have a look through once? Also wanted to know if you are doing something special / unique with multiprocessing.

I’ll help investigate this and get a solution going for you.

Thanks, Ramit

0reactions
samholtcommented, Sep 17, 2022

Dear @ramit-wandb,

I fixed it by not logging so much data all the time, by removing the line wandb.watch(model).

Thank you so much for the help as well,

I love weights and biases and consistently using it for all my work !

Keep up the great work,

Best, Sam

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CLI] wandb sync fails to upload reports from crashed scripts ...
Describe the bug If I run a script and then terminate it (ex. Ctrl-c) or it crashes for misc. reasons, I cannot use...
Read more >
Troubleshooting - Documentation - Weights & Biases - Wandb
We run wandb in a separate process to make sure that if wandb somehow crashes, your training will continue to run. If the...
Read more >
Issues · wandb/wandb · GitHub
This repo contains the CLI and Python API. - Issues · wandb/wandb. ... [CLI]: Wandb causes training to error - Dropped streaming file...
Read more >
Weights and Biases: Login and network errors - Stack Overflow
This error happens when I use the command: wandb login host=https://api.wandb.ai I have tried ... line 677, in urlopen chunked=chunked, File ...
Read more >
Fine-tuning - API - OpenAI
Fine-tuning improves on few-shot learning by training on many more ... You can use our CLI data preparation tool to easily convert your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found