question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Large files get truncated on dvc push to HTTP remote

See original GitHub issue

Bug Report

Description

When using an HTTP or HTTPS remote (e.g. Artifactory), and dvc push-ing a large file (which takes more than 1 minute to upload).

Reproduce

I believe this affects all HTTP remotes, but I experienced this using an Artifactory server, so I’ll describe that.

Create a dvc-tracked repository, with the following .dvc/config:

[core]
    remote = artifactory
    analytics = false
    check_update = false
['remote "artifactory"']
    url = https://<artifactory host>/artifactory/datasets/
    method = PUT
    ask_password = true
    auth = basic

Create a large file large_file (large enough that it will take more than 1 minute to upload to the remote). Then do

dvc add large_file
dvc push large_file.dvc

What happens: dvc push does not report an error, but the uploaded version of the file is truncated. When somebody else tries to download it with dvc pull, they get a truncated version of the file.

Expected

The expected behavior is either the file gets uploaded correctly, or at the very least that dvc reports an error when pushing.

Environment information

Output of dvc doctor:

Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.8),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.8)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: https
Workspace directory: NTFS on D:\
Repo: dvc, git

More details:

After seeing this error, I modified my dvc install a little to use aiohttp’s tracing functionality, to trace the calls to aiohttp. I then logged everything that happened when I did the dvc push large_file.dvc command, on a 2.2 GB large_file. Here is the log (slightly anonymized):

aiohttp_log_anonymized.txt

What happens is:

  • The file is broken down into chunks of 5 MiB, and uploaded using Transfer-Encoding: chunked.
  • This all goes well, until one minute in, when you get a ServerTimeoutError('Timeout on reading data from socket')
  • After this happens, a retry is triggered, and the transfer is restarted.
  • BUT: the already-transferred chunks are not included in the retry – instead it just continues with the chunks that it hasn’t transferred yet.
  • The result is that the first batch of chunks are lost, and the uploaded file is truncated.

I believe the timeout behavior comes from this line in dvc_objects. If I change that to sock_read=None, then the ServerTimeoutError doesn’t happen, and everything works.

That line was changed to the current behavior in this pull request.

I think the problem is that aiohttp’s sock_read timeout timer starts ticking at the beginning of the request, which means that it will trigger when you try to upload large files.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
guysmoilovcommented, Aug 11, 2022

I narrowed it down further: The problem is the aiohttp_retry RetryClient, which gets passed a data chunk generator in DVC’s implementation of HTTPFileSystem. The generator doesn’t seek to the start of the file on request retry, and retries should be totally unsupported when the data isn’t seekable.

I think we have to seek in the file on retries using TraceConfig: https://docs.aiohttp.org/en/stable/tracing_reference.html

But it still feels dangerous since this doesn’t seem like something that was well thought out in the retry client, just a workaround.

2reactions
guysmoilovcommented, Aug 11, 2022

I’m working on debugging the client in DVC to find the root cause and fix it ASAP but no luck so far

Read more comments on GitHub >

github_iconTop Results From Across the Web

push | Data Version Control - DVC
The dvc push and dvc pull commands are the means for uploading and downloading data to and from remote storage (S3, SSH, GCS,...
Read more >
Debugging with GDB - sourceware.org
Summary of GDB. The purpose of a debugger such as GDB is to allow you to see what is going on “inside” another...
Read more >
User Guide for AsyncOS 13.0 for Cisco Email Security ...
Log files are transferred based on a rollover schedule set by you. SCP Push. This method periodically pushes log files to an SCP...
Read more >
shcheklein/example-get-started: Get started DVC project
dvc file created. 3-config-remote : Remote HTTP storage initialized. It's a shared read only storage that contains all data artifacts produced during next...
Read more >
dvc push, change the names of files on the remote storage
Short answer: there is no way to do that. Long answer: Dvc remote is a content-based storage, so names are not preserved.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found