Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`exp show`: lockless issues

See original GitHub issue

Bug Report

Description

After the recent changes to make exp show lockless I am seeing intermittent issues with the data returned during experiment runs.

The issues that I have seen so far are as follows:

Running checkpoint experiments from the queue

exp show fails when running experiments from the queue with dulwich.porcelain.DivergedBranches errors.
@pmrowla tried to patch the above but the change led to this behaviour.

Please take a look at the above behaviour and LMK what you think. I do not anticipate there being an easy fix. IMO we should consider “dropping support” for live tracking experiments run from the queue until the DVC mechanics have been updated.

Running checkpoint experiment in the workspace

exp show returns a single dict for a set of checkpoints during an experiment. This happens intermittently and breaks our “live experiment” tracking.
exp show shows running experiments as not running mid-run.

Reproduce

Run a checkpoint experiment and monitor the output of exp show.

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

Additional Information (if any):

I’ll continue to add issues here as I find them. I have discussed with @pmrowla already. Wanted to raise the visibility by raising an issue to discuss possible mitigation and the priority for fixes.

Thanks

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

karajan1001commented, Nov 21, 2022

The problem was when we gathered experiments, and at the same time, the executor are creating new commits. It will make the checkpoint top/head return by get_running_exps/get_ref/branch_revs not the same one. And cause the problem described above.

Besides, I found that if we kill the process in the middle of this process, we will have a probability go into a corrupted Git state, that means the dulwich fetch/commit operation is not atomic. If it happened it will cause

If we kill dvc exp run we might get our EXEC_CHECKPOINT corrupted, and make resume operation fail. (We can manually delete .git/refs/exps/exec/EXEC_CHECKPOINT to recover it.)
If we kill dvc exp run --temp we will get that experiment branch corrupted, and it will not be shown in exp show table.

1reaction

karajan1001commented, Nov 4, 2022

For the problem.

Only a couple of checkpoints are shown for each experiment until they are both finished and all are then copied to the workspace.

I do some investigation on this problem yesterday. The problem comes from the refs downloaded from

https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L658-L668

        try:
            fetch_result = client.fetch(
                path,
                self.repo,
                progress=DulwichProgressReporter(progress)
                if progress
                else None,
                determine_wants=determine_wants,
            )
        except NotGitRepository as exc:
            raise SCMError(f"Git failed to fetch ref from '{url}'") from exc

will cause

https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L671-L681

        for (lh, rh, _) in fetch_refs:
            refname = os.fsdecode(rh)
            if rh in self.repo.refs:
                if self.repo.refs[rh] == fetch_result.refs[lh]:
                    result[refname] = SyncStatus.UP_TO_DATE
                    continue
                try:
                    check_diverged(
                        self.repo, self.repo.refs[rh], fetch_result.refs[lh]
                    )

raise DivergedBranches, and the problem behind this error is that the sha value fetch_result.refs[lh] does not exist in the repo and will cause KeyError. This will cause any updates (fetching) fails, only after the training progress is finished can the fetching succeed. This might come from the new commits coming to fast in the temp workspace because if we add a time gap into the training progress, for example, adding time.sleep(5) in each training epoch, the fetching can success and the training can be shown in living progress.

So if we want to solve this problem completely we might need to go deep into the dulwich’s fetching progress and make sure the revision returned is also properly downloaded and exists. And if I use force update here,

https://github.com/iterative/dvc/blob/43a8eab2e053b072b93d7b399ce0678cb00f138e/dvc/repo/experiments/queue/utils.py#L41-L42

It will raise Invalid commit Exception.