`exp show`: lockless issues
See original GitHub issueBug Report
Description
After the recent changes to make exp show
lockless I am seeing intermittent issues with the data returned during experiment runs.
The issues that I have seen so far are as follows:
Running checkpoint experiments from the queue
exp show
fails when running experiments from the queue withdulwich.porcelain.DivergedBranches
errors.- @pmrowla tried to patch the above but the change led to this behaviour.
Please take a look at the above behaviour and LMK what you think. I do not anticipate there being an easy fix. IMO we should consider “dropping support” for live tracking experiments run from the queue until the DVC
mechanics have been updated.
Running checkpoint experiment in the workspace
exp show
returns a single dict for a set of checkpoints during an experiment. This happens intermittently and breaks our “live experiment” tracking.exp show
shows running experiments as not running mid-run.
Reproduce
Run a checkpoint experiment and monitor the output of exp show
.
Expected
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.9.9 on macOS-12.3.1-x86_64-i386-64bit
Supports:
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc (subdir), git
Additional Information (if any):
I’ll continue to add issues here as I find them. I have discussed with @pmrowla already. Wanted to raise the visibility by raising an issue to discuss possible mitigation and the priority for fixes.
Thanks
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:6 (1 by maintainers)
Top Results From Across the Web
CON09-C. Avoid the ABA problem when using lock-free ...
The ABA problem occurs during synchronization: a memory location is read twice and has the same value for both reads. However, another thread...
Read more >(Almost) Lockless Stream Buffering : r/rust - Reddit
I'm a bit bothered by the unlocked access to backing_len. True, it is monotonic, and a reader can try to get the most...
Read more >Chapter 7. Designing lock-free concurrent data structures
In order to demonstrate some of the techniques used in designing lock-free data structures, we'll look at the lock-free implementation of a series...
Read more >(PDF) Making lockless synchronization fast - ResearchGate
Figure 1 illustrates the problem: thread T1 re- ... show that lockless algorithms and reclamation schemes are.
Read more >Scalable Lock-Free Dynamic Memory Allocation - People
use of locking causes many problems and limitations with ... if (*addr == expval) { ... Figure 2 shows the classic lock-free implementation...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The problem was when we gathered experiments, and at the same time, the executor are creating new commits. It will make the checkpoint top/head return by
get_running_exps
/get_ref
/branch_revs
not the same one. And cause the problem described above.Besides, I found that if we kill the process in the middle of this process, we will have a probability go into a corrupted
Git
state, that means thedulwich
fetch/commit operation is not atomic. If it happened it will causedvc exp run
we might get ourEXEC_CHECKPOINT
corrupted, and make resume operation fail. (We can manually delete.git/refs/exps/exec/EXEC_CHECKPOINT
to recover it.)dvc exp run --temp
we will get that experiment branch corrupted, and it will not be shown inexp show
table.For the problem.
I do some investigation on this problem yesterday. The problem comes from the refs downloaded from
https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L658-L668
will cause
https://github.com/iterative/scmrepo/blob/d175b923c76494c9023ee3581b349191ea2c8a6f/src/scmrepo/git/backend/dulwich/__init__.py#L671-L681
raise
DivergedBranches
, and the problem behind this error is that the sha valuefetch_result.refs[lh]
does not exist in the repo and will causeKeyError
. This will cause any updates (fetching) fails, only after the training progress is finished can the fetching succeed. This might come from the new commits coming to fast in the temp workspace because if we add a time gap into the training progress, for example, addingtime.sleep(5)
in each training epoch, the fetching can success and the training can be shown in living progress.So if we want to solve this problem completely we might need to go deep into the
dulwich
’s fetching progress and make sure the revision returned is also properly downloaded and exists. And if I use force update here,https://github.com/iterative/dvc/blob/43a8eab2e053b072b93d7b399ce0678cb00f138e/dvc/repo/experiments/queue/utils.py#L41-L42
It will raise
Invalid commit
Exception.