question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

🐛[bug] Tensoboard event file upload failed.

See original GitHub issue

Describe the bug

After upgrade to 0.19.4, tensorboard event files are failed to upload to s3 checkpoint bucket after serveral epoch, error code is attached in screenshot section.

Reproduction Steps

  1. Upgrade determined from 0.18.4 to 0.19.4.
  2. Save images and scalars with TorchWriter
from determined.tensorboard.metric_writers.pytorch import TorchWriter
logger = TorchWriter() # 

# Save to torchwritter every 200 iterations.
logger.add_image(name, image, global_step)
logger.add_scalar(name, value, global_step)

Expected Behavior

No error.

Screenshot

[2022-09-29 16:42:27] [c193c38a] [rank=0] Traceback (most recent call last): <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/opt/conda/lib/python3.8/runpy.py”, line 194, in _run_module_as_main <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] return _run_code(code, main_globals, None, <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/opt/conda/lib/python3.8/runpy.py”, line 87, in _run_code <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] exec(code, run_globals) <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py”, line 132, in <module> <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] sys.exit(main(args.train_entrypoint)) <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py”, line 123, in main <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] controller.run() <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py”, line 274, in run <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] self._run() <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py”, line 342, in _run <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] self.upload_tb_files() <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/_trial_controller.py”, line 117, in upload_tb_files <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] self.context._core.train.upload_tensorboard_files( <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py”, line 127, in upload_tensorboard_files <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] self._tensorboard_manager.sync(selector, mangler, self._distributed.rank) <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/util.py”, line 80, in wrapped <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] return fn(*arg, **kwarg) <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/tensorboard/s3.py”, line 64, in sync <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] self.client.upload_file(str(path), self.bucket, key_name) <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/opt/conda/lib/python3.8/site-packages/boto3/s3/inject.py”, line 131, in upload_file <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] return transfer.upload_file( <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] File “/opt/conda/lib/python3.8/site-packages/boto3/s3/transfer.py”, line 293, in upload_file <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] raise S3UploadFailedError( <none> [2022-09-29 16:42:27] [c193c38a] [rank=0] boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tensorboard-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6-0/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0 to ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0: An error occurred (NoSuchUpload) when calling the UploadPart operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed. <none> [2022-09-29 16:42:29] [c193c38a] Process 0 exit with status code 1. <none> [2022-09-29 16:42:29] [c193c38a] Terminating remaining workers after failure of Process 0.

Environment

  • Device or hardware: Nvidia A100 * 10
  • Environment: Kubernetes

Additional Context

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mpkouznetsovcommented, Nov 19, 2022

@sijin-dm , thank you for reporting this. Yes, the partial fix in on master, and that has been released this week (closing the instance of TorchWriter used by our Trial API to report metrics). However, we cannot close the instances of TorchWriter that you create yourself so the workaround I suggested is the only solution for now.

We will revisit our TensorBoard support in the near future but have not scheduled this work yet.

1reaction
sijin-dmcommented, Nov 8, 2022

Did you get a chance to try the workaround?

We are trying about 3 days ago, and so far so good. By the way, the same error has happened in every experiment (different tasks and different repo) accidentally. We are using this workaound for every experiment now, even though they may not use TorchWriter. We want to test it longer, and I will update the test result here after maybe one or two weeks if everything is alright.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorboard dev upload bug · Issue #3751
It sounds like the uploader is choking on some very large graphs. Can you try uploading with just scalars by using the --plugins...
Read more >
python - File too large Failed to flush events
I am trying to train a network. And the training breaks with an error saying that the file is too large and failed...
Read more >
tf.keras.callbacks.TensorBoard | TensorFlow v2.11.0
This callback logs events for TensorBoard, including: Metrics summary plots; Training graph visualization; Weight histograms; Sampled profiling.
Read more >
Save & Restore Files - Documentation - Weights & Biases
Saving Files · 1. Use wandb.save(filename) . · 2. Put a file in the wandb run directory, and it will get uploaded at...
Read more >
Troubleshoot Dataflow errors
This error occurs if a single operation causes the worker code to fail four times. Dataflow fails the job, and this message is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found