dvc exp run: experiment metrics are not reported when metric files are on another device than training code
See original GitHub issueBug Report
Issue name
dvc exp run
runs but does not store metrics.
Description
I’m running my training script on /dev/mapper/system-home
and it outputs data (model checkpoints, metrics) in /data/.cache
located on another partition (/dev/sdb1
). /dev/sdb1
is a purposely large partition where we are supposed to store large files. Running dvc exp run
works fine, but after completion dvc exp show
does not show any metrics (aswell as dvc metrics show
).
When outputting metrics to a folder on the same partition as the training script (/dev/mapper/system-home
), dvc exp show
works perfectly and shows metrics.
When using verbose mode, I get the following errors:
2022-06-08 17:49:08,234 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross
-device link
------------------------------------------------------------
Traceback (most recent call last):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
func(from_path, to_path)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
return self.fs.reflink(from_info, to_info)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
return System.reflink(path1, path2)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
System._reflink_linux(source, link_name)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
return _link(link, from_fs, from_path, to_fs, to_path)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>
The above exception was the direct cause of the following exception:
Traceback (most recent call lastInvalid cross
-device link):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
_try_links([link], from_fs, from_file, to_fs, to_file)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
raise OSError(
OSError: [Errno 95] no more link types left to try out
The full traceback can be found here: trace.Log
The Invalid cross-device link part seems to show that dvc cannot handle cross-devices operations.
Reproduce
- Create a default project on partition
/sda1/foo1
training and evaluating a model, writing metrics to another device/sdb1/foo2
# train.py on /sda1/foo1
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2
for epoch in epochs:
metrics = ...
for metric_name, value in metrics.items():
live.log(metric_name, value)
live.next_step()
ex of /data/metrics.json:
{
"step": 1,
"loss": 0.7107148170471191,
"directed_f1_weighed": 0.0,
"undirected_f1_weighed": 0.0,
"oriented_acc": 0.8346456692913385,
"officical_f1_macro": 0.0
}
ex of /data/metrics/scalar/loss.tsv:
timestamp step loss
1654703111346 0 0.8031530231237411
1654703334339 1 0.7107148170471191
dvc exp show
doesn’t show any metrics column
Expected
dvc metrics show
actually shows metrics columns.
Environment information
Python 3.8.13
Description: Ubuntu 20.04.3 LTS Release: 20.04 dvclive 0.8.2
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.13 on Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Supports:
hdfs (fsspec = 2022.5.0, pyarrow = 3.0.0),
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-home
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/system-home
Repo: dvc, git
Additional Information (if any):
I think the error comes from a missing support of cross-device copying (check https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link). Do you have any ideas ? Thanks for this nice piece of software 👍
Issue Analytics
- State:
- Created a year ago
- Comments:14 (4 by maintainers)
Top GitHub Comments
Right, normally the output paths are validated in commands like
dvc add
ordvc stage add
and DVC will error out if the output is outside the repo. We probably need to add similar checks in dvclivecc @daavoo
This will be fixed when we work on https://github.com/iterative/dvc/issues/3920. Closing in favour of that issue.