question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add : deletes files from the cache on NAS servers on Windows when duplicate images are present

See original GitHub issue

Bug Report

Description

With @anasitomtn, we have been working on using DVC on a Windows NAS server using a NTFS file system. One of our data scientists reported a strange issue when he started to use DVC, as files which were supposed to be copied to the cache completely disappeared when he used the dvc add command.

We managed to narrow the issue down. Initially we couldn’t manage to reproduct the issue with the same dvc and python versions on windows and with different images. However, when we used the same images as him, the issue appeared again. Reducing the images folder to only 2 duplicate images yielded the bug.

We are also investigating an issue with links on a Windows NAS which may or may not be linked to this.

There appears to be an issue with os.rename (see the “Additional information” section). A theory of ours is that when a duplicate is present, a cache file is created by dvc with a hash for the first duplicate image. For an unknown reason DVC supposes that all hashes are unique when building the cache, but when it tries to create a cache file for the second duplicate image, it fails as it has insufficient permissions to replace this existing cache file with the new one (which has the exact same name as the hash is deterministic). However this hypothesis needs to be confirmed. Please note that all files are removed from the original folder in any case (not only the duplicates).

This issue from 2019 appears to have a similar configuration, however it is run from Ubuntu and not from Windows (in our case, there are no issues on Linux). I do not think that the original issue was fixed, instead the ticket was closed when another bug related to “dvc version” evoked in the ticket thread was fixed.

Fortunately, all of our tests were run on test data, but we believe that this bug can be very dangerous for data scientists who would want to run experiments from production data on a NAS, as it can happen any time in the dvc workflow (before any push to a S3 remote for instance). Even worse, the bug wipes any batch of images from the workspace as long as they are included in the new dvc add : if you do a “dvc add images_folder” after having added 1000 images containing only two duplicates to a folder which is already tracked by dvc , the 1000 images will be deleted from the workspace and will not be added to the cache. If a lot of images are already present in the workspace, the DS may never notice that those new images have disappeared. If a production pipeline is run on a windows NAS with dvc add commands for ML experiments, some images could disappear silently.

Reproduce

The results of the two scenarios are the same, we added them for easier reproduction.

1 - Initialisation situation

  1. Get in an empty folder on the NAS server
  2. Put an “images” folder inside containing at least two image duplicates (and other images if you want)
  3. git init
  4. dvc init

The project should look like this :

PROJECT
    |
    ____ images/
    |
    ____ .dvc/
    |
    ____ .git/
  1. dvc add images/ -v

2 - Adding a new batch of images

  1. Put a batch of images containing at least two duplicates inside the project folder (it doesn’t matter if the files are added to the existing “images” folder or in a new “images_2” folder). The project should look like this :
PROJECT
    |
    ____ images/
    |
    ____ .dvc/
    |
    ____ .git/

or

PROJECT
    |
    ____ images/
    |
    ____ images_2/
    |
    ____ .dvc/
    |
    ____ .git/
  1. dvc add images/ -v

Expected

The images should be moved to the cache without being removed from the workspace.

DVC should at least output an error when it fails to copy the files to the cache and not touch any of the original files.

Environment information

Any version of DVC and python running on Windows, on a NAS server.

2021-07-28 17:40:28,766 DEBUG: Version info for developers:
DVC version: 2.5.4 (pip)

Platform: Python 3.7.10 on Windows-10-10.0.17763-SP0
Supports:
        http (requests = 2.26.0),
        https (requests = 2.26.0)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: None
Workspace directory: ('unknown', 'none')
Repo: dvc, git

Additional Information (if any):

Here are the logs (note : “accès refusé” means “access not permitted”) :

2021-07-28 15:33:50,749 DEBUG: Removing 'random_images\test\img_with_labels_batch_0_img_2.jpg'
2021-07-28 15:33:50,801 DEBUG: state save (3148873755809107765, 1624811388452225792, 25167) 86054b29dbbdec89e748fe596058c843
2021-07-28 15:33:50,902 DEBUG: 'random_images\test\img_with_labels_batch_0_img_6.jpg' file already exists, skipping
2021-07-28 15:33:50,903 DEBUG: Removing 'random_images\test\img_with_labels_batch_0_img_6.jpg'
2021-07-28 15:33:50,934 DEBUG: state save (3150879666208347930, 1624811388499105280, 2259) d00fece80aab30a39148e4418ce4ca6a
2021-07-28 15:33:51,182 ERROR: unexpected error - [WinError -2147024891] Acc▒s refus▒


Traceback (most recent call last):
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 566, in move
    os.rename(src, real_dst)
PermissionError: [WinError 32] Le processus ne peut pas acc▒der au fichier car ce fichier est utilis▒ par un autre processus: 'N:\\Projets01\\STAGE\\test_dvc_isma_5\\.dvc\\cache\\d0\\0fece80aab30a39148e4418ce4ca6a.a3eEysHxDTQD2RF3DvqSdQ' -> 'N:\\Projets01\\STAGE\\test_dvc_isma_5\\.dvc\\cache\\d0\\0fece80aab30a39148e4418ce4ca6a'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\main.py", line 55, in main
    ret = cmd.do_run()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\command\base.py", line 50, in do_run
    return self.run()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\command\add.py", line 32, in run
    jobs=self.args.jobs,
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\__init__.py", line 50, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
    return method(repo, *args, **kw)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\add.py", line 131, in add
    **kwargs,
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\add.py", line 195, in _process_stages
    stage.commit()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\funcy\decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
    return call()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\funcy\decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\stage\__init__.py", line 492, in commit
    out.commit(filter_info=filter_info)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\output.py", line 567, in commit
    objects.save(self.odb, obj)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\objects\__init__.py", line 29, in save
    future.result()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\_base.py", line 428, in result
    return self.__get_result()
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\objects\db\base.py", line 59, in add
    self.fs.move(path_info, cache_info)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\fs\local.py", line 97, in move
    move(from_info, to_info)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\utils\fs.py", line 110, in move
    shutil.move(tmp, dst)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 580, in move
    copy_function(src, real_dst)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 266, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\speedcopy\__init__.py", line 289, in copyfile
    '\\\\?\\' + dest_file, None)
  File "_ctypes/callproc.c", line 922, in GetResult
PermissionError: [WinError -2147024891] Acc▒s refus▒`

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
cubrinkcommented, Aug 4, 2021

@efiop Just spoke to IT as I couldn’t speak authoritatively about the network configuration. I was told accessing the NAS is via SMB protocol, not NFS

1reaction
cubrinkcommented, Aug 3, 2021

I am also running into this problem. I can reproduce the error as described and I get the same verbose output as @louistransfer.

I’ve done some debugging and I’m thinking it is some sort of a race condition. While debugging I saw this error which wasn’t shown in the verbose output:

[WinError 183] Cannot create a file when that file already exists: '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350.dKvQKmJbSYaXVe6gD9k2j6' -> '<path_to_my_dvc_project>\\.dvc\\cache\\f6\\5bcf2182da5af309d2b30c77f79350'
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\shutil.py", line 791, in move
    os.rename(src, real_dst)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\utils\fs.py", line 114, in move
    shutil.move(tmp, dst)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\fs\local.py", line 97, in move
    move(from_info, to_info)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\site-packages\dvc\objects\db\base.py", line 78, in add
    self.fs.move(path_info, cache_info)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\concurrent\futures\thread.py", line 80, in _worker
    work_item.run()
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 932, in _bootstrap_inner
    self.run()
  File "C:\Users\cbrinker\Anaconda3\envs\dvc-test\Lib\threading.py", line 890, in _bootstrap
    self._bootstrap_inner()

(Note that the file already existed as I had duplicate files that I was adding)

I saw in the verbose stack trace that dvc.objects.__init__.save() was being called, which uses concurrent.futures.ThreadPoolExecutor. I tried to limit to 1 thread using dvc add images/ -v --jobs 1, but apparently --jobs requires --to-remote to be used.

Instead I put a breakpoint inside of dvc.objects.__init__.save() and set jobs = 1 manually. Doing this prevented the files from being removed.

Environment

dvc doctor

DVC version: 2.5.4 (pip)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
        http (requests = 2.25.1),
        https (requests = 2.25.1)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: local, local, local, local, local
Workspace directory: ('unknown', 'none')
Repo: dvc, git

Also, I’ve ran pip list and I do not have speedcopy installed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Deduplication Overview - Microsoft Learn
Data Deduplication helps storage administrators reduce costs that are associated with duplicated data. Large datasets often have a lot of ...
Read more >
Synology NAS tip - How to find duplicate files on your NAS
If you found this video useful please like and subscribe to our channel. Duplicate files can be a costly waste of storage space...
Read more >
Twonky Server - My Cloud - WD Community
I'm noticing many duplicate files when I enter my Twonky Server. Is there a way to delete these files (photos, videos)? Thanks!
Read more >
Manage Photos - Knowledge Center - Synology
To determine what action to take when a duplicate file is uploaded, click the Account icon > Settings > Personal > Duplicate Files...
Read more >
NEVER Use A RAID As Your Backup System! - Pete Marovich
It's only a backup if primary data storage is elsewhere (e.g. server disks/drives) and NAS + RAID data is copy/duplicate. All good storage...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found