`dvc queue`: unexpected behaviour
See original GitHub issueBug Report
Description
Whilst checking out the new dvc queue
command I have run into some unexpected behaviour. I won’t duplicate the steps to reproduce here but after queueing and running experiments I have run in to two different issues.
VS Code demo project: dvc queue status
returning ERROR: Invalid experiment '{entry.stash_rev[:7]}'.
(produced when running with the extension)
example-get-started
: dvc queue status
returning
Task Name Created Status
f3d69ee 02:17 PM Success
08ccb05 02:17 PM Success
ERROR: unexpected error - Extra data: line 1 column 56 (char 55)
(produced without having the extension involved).
In both instances this resulted in the HEAD baseline entry being dropped from the exp show
data:
example-get-started example
❯ dvc exp show --show-json
{
"workspace": {
"baseline": {
"data": {
"timestamp": null,
"params": {
"params.yaml": {
"data": {
"prepare": {
"split": 0.21,
"seed": 20170428
},
"featurize": {
"max_features": 200,
"ngrams": 2
},
"train": {
"seed": 20170428,
"n_est": 50,
"min_split": 0.01
}
}
}
},
"deps": {
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null
},
"src/prepare.py": {
"hash": "f09ea0c15980b43010257ccb9f0055e2",
"size": 1576,
"nfiles": null
},
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2
},
"src/featurization.py": {
"hash": "e0265fc22f056a4b86d85c3056bc2894",
"size": 2490,
"nfiles": null
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2
},
"src/train.py": {
"hash": "c3961d777cfbd7727f9fde4851896006",
"size": 967,
"nfiles": null
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null
},
"src/evaluate.py": {
"hash": "44e714021a65edf881b1716e791d7f59",
"size": 2346,
"nfiles": null
}
},
"outs": {
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null,
"use_cache": true,
"is_data_source": false
},
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null,
"use_cache": true,
"is_data_source": true
}
},
"queued": false,
"running": false,
"executor": null,
"metrics": {
"evaluation.json": {
"data": {
"avg_prec": 0.9249974999612706,
"roc_auc": 0.9460213440787918
}
}
}
}
}
},
"f3d69eedda6b1c051b115523cf5c6c210490d0ea": {
"baseline": {
"data": {
"timestamp": "2022-07-13T14:17:20",
"params": {
"params.yaml": {
"data": {
"prepare": {
"split": 0.21,
"seed": 20170428
},
"featurize": {
"max_features": 200,
"ngrams": 2
},
"train": {
"seed": 20170428,
"n_est": 50,
"min_split": 0.01
}
}
}
},
"deps": {
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null
},
"src/prepare.py": {
"hash": "f09ea0c15980b43010257ccb9f0055e2",
"size": 1576,
"nfiles": null
},
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2
},
"src/featurization.py": {
"hash": "e0265fc22f056a4b86d85c3056bc2894",
"size": 2490,
"nfiles": null
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2
},
"src/train.py": {
"hash": "c3961d777cfbd7727f9fde4851896006",
"size": 967,
"nfiles": null
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null
},
"src/evaluate.py": {
"hash": "44e714021a65edf881b1716e791d7f59",
"size": 2346,
"nfiles": null
}
},
"outs": {
"data/prepared": {
"hash": "153aad06d376b6595932470e459ef42a.dir",
"size": 8437363,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"data/features": {
"hash": "f35d4cc2c552ac959ae602162b8543f3.dir",
"size": 2232588,
"nfiles": 2,
"use_cache": true,
"is_data_source": false
},
"model.pkl": {
"hash": "46865edbf3d62fc5c039dd9d2b0567a4",
"size": 1763725,
"nfiles": null,
"use_cache": true,
"is_data_source": false
},
"data/data.xml": {
"hash": "22a1a2931c8370d3aeedd7183606fd7f",
"size": 14445097,
"nfiles": null,
"use_cache": true,
"is_data_source": true
}
},
"queued": false,
"running": false,
"executor": null,
"metrics": {
"evaluation.json": {
"data": {
"avg_prec": 0.9249974999612706,
"roc_auc": 0.9460213440787918
}
}
}
}
}
}
}
Reproduce
- clone
example-get-started
- add
git+https://github.com/iterative/dvc
tosrc/requirements.txt
- create venv, source activate script and install requirements
dvc pull
- change params.yaml and queue x2 with
dvc exp run --queue
dvc queue start -j 2
dvc exp show
dvc queue status
dvc exp show
When recreating this I can see that both experiments were successful in dvc queue status
but the second one has not made it into the table. Final results:
❯ dvc queue status
Task Name Created Status
9d22751 02:50 PM Success
962c834 02:50 PM Success
Worker status: 0 active, 0 idle
First column of exp show
:
workspace
bigrams-experiment
└── 65584bd [exp-c88e8]
and the shas don’t match?
Expected
Should be able to run exp show
& queue status
in parallel with the execution of tasks from the queue.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.13.1.dev87+gc2668110
---------------------------------
Platform: Python 3.8.9 on macOS-12.2.1-arm64-arm-64bit
Supports:
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Additional Information (if any):
Please let me know if you need anything else from me. Thank you.
Issue Analytics
- State:
- Created a year ago
- Comments:18 (12 by maintainers)
Top GitHub Comments
Sound like related to https://github.com/iterative/dvc-task/issues/73. I tried several times but didn’t meet this. I guess it is not related to the experiments in old versions. And related to 1. concurrency 2. checkpoint. I can repair the error message ‘{entry.stash_rev[:7]}’ first to see what
stash_rev
value it is.Tl;dr - I can recreate the issue by using
dvc queue start -j 2
. As j > 1 is currently experimental we can probably close this.I can definitely recreate it. I just ran into it again:
When trying to clean up experiments after getting that warning:
This will be an issue in the extension because of errors generate a popup that the user sees.
Deleting
.dvc/tmp/exps
gets rid of the error altogether.Repro steps:
2.11.0
2.15.0
.dvc queue start -j 2
dvc exp show --show-json
almost immediately after starting the queue (as per extension).dvc queue status
returnsERROR: Invalid experiment '{entry.stash_rev[:7]}'.
Even these repro steps are a bit hit or miss. From 3 attempts I hit the error and with a missing experiment 2 times.
I can also recreate just by using steps 4-8 (no upgrade needed).
The error is probably caused by 5+6. As j > 1 is a known issue we can probably close this.