question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

git: timeout when cloning a large git repo

See original GitHub issue

Bug Report

Issue name

dvc.api.read: RuntimeError when reading file from a large repo

Description

I am getting the following exception, when trying to read a file from the large repo:

 File "/Users/radion/ANNA/anna-evidence-doc-classifier/dashboard/onboarding_model.py", line 120, in get_train_data
    dvc_buffer = dvc.api.read(
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/api.py", line 88, in read
    with open(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/api.py", line 75, in _open
    with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 32, in external_repo
    path = _cached_clone(url, rev, for_write=for_write)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 152, in _cached_clone
    clone_path, shallow = _clone_default_branch(url, rev, for_write=for_write)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/flow.py", line 274, in wrap_with
    return call()
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/external_repo.py", line 216, in _clone_default_branch
    git = Git.clone(url, clone_path)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/scm/git/__init__.py", line 117, in clone
    backend.clone(url, to_path, **kwargs)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/dvc/scm/git/backend/gitpython.py", line 181, in clone
    tmp_repo = clone_from()
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/repo/base.py", line 1148, in clone_from
    return cls._clone(git, url, to_path, GitCmdObjectDB, progress, multi_options, **kwargs)
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/repo/base.py", line 1078, in _clone
    handle_process_output(proc, None, to_progress_instance(progress).new_message_handler(),
  File "/Users/radion/ANNA/anna-evidence-doc-classifier/venv/lib/python3.8/site-packages/git/cmd.py", line 151, in handle_process_output
    raise RuntimeError(f"Thread join() timed out in cmd.handle_process_output(). Timeout={timeout} seconds")
RuntimeError: Thread join() timed out in cmd.handle_process_output(). Timeout=10.0 seconds

I had done something similar before with a much slimmer repo and there were no problems. Looks like the function is running out of the 10sec timeout during the cloning stage.

Is it possible to customize the timeout or avoid cloning the entire repo?

Environment information

Output of dvc doctor:

DVC version: 2.7.3 (pip)
---------------------------------
Platform: Python 3.8.6 on macOS-10.16-x86_64-i386-64bit
Supports:
        gs (gcsfs = 2021.8.1),
        hdfs (pyarrow = 5.0.0),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
efiopcommented, Sep 22, 2021

@RadionBik Do you still use those? If not, have you considered cleaning those blobs out (e.g. with https://github.com/rtyley/bfg-repo-cleaner)? We definitely need to support timeout on our side, but just wondering what you plan on doing with that git repo.

1reaction
RadionBikcommented, Sep 22, 2021

@efiop here is the output of time cmd:

time git clone git@some_repo.git
Cloning into 'some_repo'...
remote: Enumerating objects: 6402, done.
remote: Counting objects: 100% (2121/2121), done.
remote: Compressing objects: 100% (782/782), done.
remote: Total 6402 (delta 1509), reused 1795 (delta 1328), pack-reused 4281
Receiving objects: 100% (6402/6402), 209.00 MiB | 6.11 MiB/s, done.
Resolving deltas: 100% (3840/3840), done.
git clone git@github.com:some_repo.git  9.02s user 2.50s system 28% cpu 41.114 total

We used to store small datasets there, but it is not the case for the master branch now (but the history is already polluted). As you can see, it takes about 40sec to clone the repo.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Git timeout when cloning a large repository - Jenkins Community
I am trying to build my project but I get these errors: The recommended git tool is: NONE No credentials specified /usr/bin/git rev-parse ......
Read more >
Is it possible to specify a timeout for `git clone` operation?
I suggest a method: git clone with --depth 1 or clone single branch: git clone [remote url] --branch [branch_name] --single-branch [folder] ...
Read more >
Solved: Cloning a git repo fails with timeout error
Cloning did not fail completely, it cloned some files but not the complete repo. I am guessing this is due to the huge...
Read more >
Cloning a huge git repo ends with a timeout
Hi,. Git command timeout can be configured via Administration > Repository >> SCM Details >> Git Options section >> Command Timeout . It's...
Read more >
Increase Timeout Threshold for Large Repos | Git Integration ...
Login to the hosting server. · Create a new directory in the path different from the JIRA HOME directory. · Go to a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found