question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`exp show`: lockless issues

See original GitHub issue

Bug Report

Description

After the recent changes to make exp show lockless I am seeing intermittent issues with the data returned during experiment runs.

The issues that I have seen so far are as follows:

Running checkpoint experiments from the queue

  1. exp show fails when running experiments from the queue with dulwich.porcelain.DivergedBranches errors.
  2. @pmrowla tried to patch the above but the change led to this behaviour.

Please take a look at the above behaviour and LMK what you think. I do not anticipate there being an easy fix. IMO we should consider “dropping support” for live tracking experiments run from the queue until the DVC mechanics have been updated.

Running checkpoint experiment in the workspace

  1. exp show returns a single dict for a set of checkpoints during an experiment. This happens intermittently and breaks our “live experiment” tracking.
  2. exp show shows running experiments as not running mid-run.

Reproduce

Run a checkpoint experiment and monitor the output of exp show.

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git

Additional Information (if any):

I’ll continue to add issues here as I find them. I have discussed with @pmrowla already. Wanted to raise the visibility by raising an issue to discuss possible mitigation and the priority for fixes.

Thanks

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
karajan1001commented, Nov 21, 2022

The problem was when we gathered experiments, and at the same time, the executor are creating new commits. It will make the checkpoint top/head return by get_running_exps/get_ref/branch_revs not the same one. And cause the problem described above.

Besides, I found that if we kill the process in the middle of this process, we will have a probability go into a corrupted Git state, that means the dulwich fetch/commit operation is not atomic. If it happened it will cause

  1. If we kill dvc exp run we might get our EXEC_CHECKPOINT corrupted, and make resume operation fail. (We can manually delete .git/refs/exps/exec/EXEC_CHECKPOINT to recover it.)
  2. If we kill dvc exp run --temp we will get that experiment branch corrupted, and it will not be shown in exp show table.
1reaction
karajan1001commented, Nov 4, 2022

For the problem.

  1. Only a couple of checkpoints are shown for each experiment until they are both finished and all are then copied to the workspace.

I do some investigation on this problem yesterday. The problem comes from the refs downloaded from

https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L658-L668

        try:
            fetch_result = client.fetch(
                path,
                self.repo,
                progress=DulwichProgressReporter(progress)
                if progress
                else None,
                determine_wants=determine_wants,
            )
        except NotGitRepository as exc:
            raise SCMError(f"Git failed to fetch ref from '{url}'") from exc

will cause

https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L671-L681

        for (lh, rh, _) in fetch_refs:
            refname = os.fsdecode(rh)
            if rh in self.repo.refs:
                if self.repo.refs[rh] == fetch_result.refs[lh]:
                    result[refname] = SyncStatus.UP_TO_DATE
                    continue
                try:
                    check_diverged(
                        self.repo, self.repo.refs[rh], fetch_result.refs[lh]
                    )

raise DivergedBranches, and the problem behind this error is that the sha value fetch_result.refs[lh] does not exist in the repo and will cause KeyError. This will cause any updates (fetching) fails, only after the training progress is finished can the fetching succeed. This might come from the new commits coming to fast in the temp workspace because if we add a time gap into the training progress, for example, adding time.sleep(5) in each training epoch, the fetching can success and the training can be shown in living progress.

image

So if we want to solve this problem completely we might need to go deep into the dulwich’s fetching progress and make sure the revision returned is also properly downloaded and exists. And if I use force update here,

https://github.com/iterative/dvc/blob/43a8eab2e053b072b93d7b399ce0678cb00f138e/dvc/repo/experiments/queue/utils.py#L41-L42

It will raise Invalid commit Exception.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CON09-C. Avoid the ABA problem when using lock-free ...
The ABA problem occurs during synchronization: a memory location is read twice and has the same value for both reads. However, another thread...
Read more >
(Almost) Lockless Stream Buffering : r/rust - Reddit
I'm a bit bothered by the unlocked access to backing_len. True, it is monotonic, and a reader can try to get the most...
Read more >
Chapter 7. Designing lock-free concurrent data structures
In order to demonstrate some of the techniques used in designing lock-free data structures, we'll look at the lock-free implementation of a series...
Read more >
(PDF) Making lockless synchronization fast - ResearchGate
Figure 1 illustrates the problem: thread T1 re- ... show that lockless algorithms and reclamation schemes are.
Read more >
Scalable Lock-Free Dynamic Memory Allocation - People
use of locking causes many problems and limitations with ... if (*addr == expval) { ... Figure 2 shows the classic lock-free implementation...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found