`dvc push` fails with "Timeout on reading data from socket" with HTTP storage
See original GitHub issueBug Report
Description
When trying to push some moderately large amount of data (e.g. 1 GB) to a GitLab generic packages repository (via the HTTP remote storage type), the upload fails after a few (e.g. 5) minutes with the following timeout error:
$ dvc push -v
2022-07-28 17:33:07,919 DEBUG: Preparing to transfer data from '/home/user/dev/dvc-test/.dvc/cache' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Preparing to collect status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Collecting status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,920 DEBUG: Querying 2 hashes via object_exists
2022-07-28 17:33:08,249 DEBUG: Preparing to collect status from '/home/user/dev/dvc-test/.dvc/cache'
2022-07-28 17:33:08,250 DEBUG: Collecting status from '/home/user/dev/dvc-test/.dvc/cache'
2022-07-28 17:33:08,331 DEBUG: Uploading '/home/user/dev/dvc-test/.dvc/cache/1c/90d4b78d043e2a58e0746231c9024f' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/1c/90d4b78d043e2a58e0746231c9024f'
2022-07-28 17:38:09,183 ERROR: failed to transfer 'md5: 1c90d4b78d043e2a58e0746231c9024f' - : Timeout on reading data from socket
------------------------------------------------------------
Traceback (most recent call last):
File "<site-packages>/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "<site-packages>/fsspec/implementations/http.py", line 294, in _put_file
async with meth(rpath, data=gen_chunks(), **kw) as resp:
File "<site-packages>/aiohttp_retry/client.py", line 98, in __aenter__
return await self._do_request()
File "<site-packages>/aiohttp_retry/client.py", line 86, in _do_request
raise e
File "<site-packages>/aiohttp_retry/client.py", line 71, in _do_request
response: ClientResponse = await self._request(
File "<site-packages>/aiohttp/client.py", line 559, in _request
await resp.start(conn)
File "<site-packages>/aiohttp/client_reqrep.py", line 898, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "<site-packages>/aiohttp/streams.py", line 616, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<site-packages>/dvc/data/transfer.py", line 25, in wrapper
func(fs_path, *args, **kwargs)
File "<site-packages>/dvc/data/transfer.py", line 162, in func
return dest.add(
File "<site-packages>/dvc/objects/db.py", line 126, in add
self._add_file(
File "<site-packages>/dvc/objects/db.py", line 98, in _add_file
return fs.utils.transfer(
File "<site-packages>/dvc/fs/utils.py", line 96, in transfer
_try_links(links, from_fs, from_path, to_fs, to_path)
File "<site-packages>/dvc/fs/utils.py", line 66, in _try_links
return _copy(from_fs, from_path, to_fs, to_path)
File "<site-packages>/dvc/fs/utils.py", line 44, in _copy
return to_fs.upload(from_path, to_path)
File "<site-packages>/dvc/fs/base.py", line 386, in upload
return self.put_file(from_info, to_info, callback=cb, size=size)
File "<site-packages>/dvc/fs/http.py", line 151, in put_file
super().put_file(
File "<site-packages>/dvc/fs/base.py", line 329, in put_file
self.fs.put_file(
File "<site-packages>/fsspec/asyn.py", line 85, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "<site-packages>/fsspec/asyn.py", line 63, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
------------------------------------------------------------
2022-07-28 17:38:09,197 ERROR: failed to push data to the cloud - 1 files failed to upload
------------------------------------------------------------
Traceback (most recent call last):
File "<site-packages>/dvc/repo/push.py", line 68, in push
pushed += self.cloud.push(
File "<site-packages>/dvc/data_cloud.py", line 85, in push
return transfer(
File "<site-packages>/dvc/data/transfer.py", line 176, in transfer
_do_transfer(
File "<site-packages>/dvc/data/transfer.py", line 116, in _do_transfer
raise FileTransferError(total_fails)
dvc.exceptions.FileTransferError: 1 files failed to transfer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<site-packages>/dvc/commands/data_sync.py", line 58, in run
processed_files_count = self.repo.push(
File "<site-packages>/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "<site-packages>/dvc/repo/push.py", line 72, in push
raise UploadError(exc.amount)
dvc.exceptions.UploadError: 1 files failed to upload
Reproduce
- Create a new project on gitlab.com and take a note of the project ID.
- Create a deploy token under
Settings > Repository > Deploy tokens
with scopewrite_package_registry
. - Clone the project from gitlab.com locally and change the current working directory to the cloned project.
- Initialize DVC and configure the remote storage:
dvc init dvc remote add -d gitlab https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data dvc remote modify gitlab method PUT dvc remote modify gitlab auth custom dvc remote modify --local gitlab custom_auth_header 'DEPLOY-TOKEN' dvc remote modify --local gitlab password <deploy_token>
- Create some data and and track it with DVC:
mkdir data head -c 1G /dev/urandom > data/data.txt dvc add data/data.txt
- Push the data to the DVC remote storage:
dvc push -v
Expected
Successful upload of the data.
When I upload the same data to the GitLab generic packages repository via cURL, it works fine:
curl \
--header "DEPLOY-TOKEN: <deploy_token>" \
--upload-file data/data.txt \
https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.10.2 (snap)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-48-generic-x86_64-with-glibc2.29
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2022.3.0),
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
ssh (sshfs = 2022.3.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https, https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Additional Information (if any):
n/a
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:28 (2 by maintainers)
Top Results From Across the Web
Troubleshooting | Data Version Control - DVC
Make sure to dvc push from the original projectproject, and try again. Too many open files error. A known problem some users run...
Read more >shcheklein/example-get-started: Get started DVC project
First .dvc file created. 3-config-remote : Remote HTTP storage initialized. It's a shared read only storage that contains all data artifacts ...
Read more >Git clone or Git push fails to an Azure DevOps repository
proxy http://proxyUsername:proxyPassword@proxy.server.com:port . To use specific proxy for some of URLs, configure the proxy URL in Git config ...
Read more >@dvcorg/cml - NPM Package Overview - Socket - Socket.dev
Codify data and models with DVC instead of pushing to a Git repo. Auto reports for ML experiments. Auto-generate reports with metrics and...
Read more >How to fix DVC error 'FileNotFoundError: [Errno 2] No such file ...
But strangely the file is not present. Maybe I can push the files again somehow and recreate the data in the repository? dvc...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We should probably get to implementing the configuration of timeout.
Original report from @michuhu: https://discord.com/channels/485586884165107732/563406153334128681/1035466365659779093
DVC over http works like charm, but only for small files. When I try to push a bigger file (say 20MB), I get
Timeout on reading data from socket
. I can see was an issue on github and it should be fixed fordvc-objects==0.1.7
, however, I havedvc_objects = 0.7.0
and the problem occurs.