question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dvc exp run: experiment metrics are not reported when metric files are on another device than training code

See original GitHub issue

Bug Report

Issue name

dvc exp run runs but does not store metrics.

Description

I’m running my training script on /dev/mapper/system-home and it outputs data (model checkpoints, metrics) in /data/.cache located on another partition (/dev/sdb1). /dev/sdb1 is a purposely large partition where we are supposed to store large files. Running dvc exp run works fine, but after completion dvc exp show does not show any metrics (aswell as dvc metrics show).

When outputting metrics to a folder on the same partition as the training script (/dev/mapper/system-home), dvc exp show works perfectly and shows metrics.

When using verbose mode, I get the following errors:

2022-06-08 17:49:08,234 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross
-device link
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
    return self.fs.reflink(from_info, to_info)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
    return System.reflink(path1, path2)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call lastInvalid cross
-device link):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out

The full traceback can be found here: trace.Log

The Invalid cross-device link part seems to show that dvc cannot handle cross-devices operations.

Reproduce

  1. Create a default project on partition /sda1/foo1 training and evaluating a model, writing metrics to another device /sdb1/foo2
# train.py on /sda1/foo1 
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2
for epoch in epochs:
    metrics = ...
    for metric_name, value in metrics.items():
          live.log(metric_name, value)
    live.next_step()

ex of /data/metrics.json:

{
    "step": 1,
    "loss": 0.7107148170471191,
    "directed_f1_weighed": 0.0,
    "undirected_f1_weighed": 0.0,
    "oriented_acc": 0.8346456692913385,
    "officical_f1_macro": 0.0
}

ex of /data/metrics/scalar/loss.tsv:

timestamp	step	loss
1654703111346	0	0.8031530231237411
1654703334339	1	0.7107148170471191
  1. dvc exp show doesn’t show any metrics column image image

Expected

dvc metrics show actually shows metrics columns.

Environment information

Python 3.8.13

Description: Ubuntu 20.04.3 LTS Release: 20.04 dvclive 0.8.2

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.13 on Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Supports:
        hdfs (fsspec = 2022.5.0, pyarrow = 3.0.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-home
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/system-home
Repo: dvc, git

Additional Information (if any):

I think the error comes from a missing support of cross-device copying (check https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link). Do you have any ideas ? Thanks for this nice piece of software 👍

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
pmrowlacommented, Jun 27, 2022

Right, normally the output paths are validated in commands like dvc add or dvc stage add and DVC will error out if the output is outside the repo. We probably need to add similar checks in dvclive

cc @daavoo

0reactions
skshetrycommented, Jun 28, 2022

This will be fixed when we work on https://github.com/iterative/dvc/issues/3920. Closing in favour of that issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

exp run | Data Version Control - DVC
Provides a way to execute and track experimentsexperiments in your projectproject without polluting it with unnecessary commits, branches, directories, etc.
Read more >
Machine Learning Experiment Management: How to Organize ...
Experiment management in the context of machine learning is a process of tracking experiment metadata like: code versions,; data versions,; hyperparameters, ...
Read more >
Experimenting and Reproducibility - DagsHub Docs
The first command is the regular Git checkout that branches from the master. After checking out the Git files, our DVC tracked files...
Read more >
DVC: Data Versioning and ML Experiments on Top of Git
Abstract: DVC : Data Versioning and ML Experiments on Top of Git. ML experimentation or ML metrics logging tools become very popular these ......
Read more >
How machine learning ops works with GitLab and continuous ...
In this workflow, we have additional steps that use DVC to pull a training dataset, run an experiment, and then use CML to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found