git: timeout when cloning a large git repo
See original GitHub issueBug Report
Issue name
dvc.api.read: RuntimeError when reading file from a large repo
Description
I am getting the following exception, when trying to read a file from the large repo:
File "/Users/radion/ANNA/anna-evidence-doc-classifier/dashboard/onboarding_model.py", line 120, in get_train_data
dvc_buffer = dvc.api.read(
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/api.py", line 88, in read
with open(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/api.py", line 75, in _open
with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 32, in external_repo
path = _cached_clone(url, rev, for_write=for_write)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 152, in _cached_clone
clone_path, shallow = _clone_default_branch(url, rev, for_write=for_write)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/flow.py", line 274, in wrap_with
return call()
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 216, in _clone_default_branch
git = Git.clone(url, clone_path)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/scm/git/__init__.py", line 117, in clone
backend.clone(url, to_path, **kwargs)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/scm/git/backend/gitpython.py", line 181, in clone
tmp_repo = clone_from()
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/repo/base.py", line 1148, in clone_from
return cls._clone(git, url, to_path, GitCmdObjectDB, progress, multi_options, **kwargs)
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/repo/base.py", line 1078, in _clone
handle_process_output(proc, None, to_progress_instance(progress).new_message_handler(),
File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/cmd.py", line 151, in handle_process_output
raise RuntimeError(f"Thread join() timed out in cmd.handle_process_output(). Timeout={timeout} seconds")
RuntimeError: Thread join() timed out in cmd.handle_process_output(). Timeout=10.0 seconds
I had done something similar before with a much slimmer repo and there were no problems. Looks like the function is running out of the 10sec timeout during the cloning stage.
Is it possible to customize the timeout or avoid cloning the entire repo?
Environment information
Output of dvc doctor:
DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.6 on macOS-10.16-x86_64-i386-64bit
Supports:
gs (gcsfs = 2021.8.1),
hdfs (pyarrow = 5.0.0),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Git timeout when cloning a large repository - Jenkins Community
I am trying to build my project but I get these errors: The recommended git tool is: NONE No credentials specified /usr/bin/git rev-parse ......
Read more >Is it possible to specify a timeout for `git clone` operation?
I suggest a method: git clone with --depth 1 or clone single branch: git clone [remote url] --branch [branch_name] --single-branch [folder] ...
Read more >Solved: Cloning a git repo fails with timeout error
Cloning did not fail completely, it cloned some files but not the complete repo. I am guessing this is due to the huge...
Read more >Cloning a huge git repo ends with a timeout
Hi,. Git command timeout can be configured via Administration > Repository >> SCM Details >> Git Options section >> Command Timeout . It's...
Read more >Increase Timeout Threshold for Large Repos | Git Integration ...
Login to the hosting server. · Create a new directory in the path different from the JIRA HOME directory. · Go to a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@RadionBik Do you still use those? If not, have you considered cleaning those blobs out (e.g. with https://github.com/rtyley/bfg-repo-cleaner)? We definitely need to support timeout on our side, but just wondering what you plan on doing with that git repo.
@efiop here is the output of time cmd:
We used to store small datasets there, but it is not the case for the master branch now (but the history is already polluted). As you can see, it takes about 40sec to clone the repo.