add : deletes files from the cache on NAS servers on Windows when duplicate images are present
See original GitHub issueBug Report
Description
With @anasitomtn, we have been working on using DVC on a Windows NAS server using a NTFS file system. One of our data scientists reported a strange issue when he started to use DVC, as files which were supposed to be copied to the cache completely disappeared when he used the dvc add command.
We managed to narrow the issue down. Initially we couldn’t manage to reproduct the issue with the same dvc and python versions on windows and with different images. However, when we used the same images as him, the issue appeared again. Reducing the images folder to only 2 duplicate images yielded the bug.
We are also investigating an issue with links on a Windows NAS which may or may not be linked to this.
There appears to be an issue with os.rename (see the “Additional information” section). A theory of ours is that when a duplicate is present, a cache file is created by dvc with a hash for the first duplicate image. For an unknown reason DVC supposes that all hashes are unique when building the cache, but when it tries to create a cache file for the second duplicate image, it fails as it has insufficient permissions to replace this existing cache file with the new one (which has the exact same name as the hash is deterministic). However this hypothesis needs to be confirmed. Please note that all files are removed from the original folder in any case (not only the duplicates).
This issue from 2019 appears to have a similar configuration, however it is run from Ubuntu and not from Windows (in our case, there are no issues on Linux). I do not think that the original issue was fixed, instead the ticket was closed when another bug related to “dvc version” evoked in the ticket thread was fixed.
Fortunately, all of our tests were run on test data, but we believe that this bug can be very dangerous for data scientists who would want to run experiments from production data on a NAS, as it can happen any time in the dvc workflow (before any push to a S3 remote for instance). Even worse, the bug wipes any batch of images from the workspace as long as they are included in the new dvc add : if you do a “dvc add images_folder” after having added 1000 images containing only two duplicates to a folder which is already tracked by dvc , the 1000 images will be deleted from the workspace and will not be added to the cache. If a lot of images are already present in the workspace, the DS may never notice that those new images have disappeared. If a production pipeline is run on a windows NAS with dvc add commands for ML experiments, some images could disappear silently.
Reproduce
The results of the two scenarios are the same, we added them for easier reproduction.
1 - Initialisation situation
- Get in an empty folder on the NAS server
- Put an “images” folder inside containing at least two image duplicates (and other images if you want)
git initdvc init
The project should look like this :
PROJECT
|
____ images/
|
____ .dvc/
|
____ .git/
dvc add images/ -v
2 - Adding a new batch of images
- Put a batch of images containing at least two duplicates inside the project folder (it doesn’t matter if the files are added to the existing “images” folder or in a new “images_2” folder). The project should look like this :
PROJECT
|
____ images/
|
____ .dvc/
|
____ .git/
or
PROJECT
|
____ images/
|
____ images_2/
|
____ .dvc/
|
____ .git/
dvc add images/ -v
Expected
The images should be moved to the cache without being removed from the workspace.
DVC should at least output an error when it fails to copy the files to the cache and not touch any of the original files.
Environment information
Any version of DVC and python running on Windows, on a NAS server.
2021-07-28 17:40:28,766 DEBUG: Version info for developers:
DVC version: 2.5.4 (pip)
Platform: Python 3.7.10 on Windows-10-10.0.17763-SP0
Supports:
http (requests = 2.26.0),
https (requests = 2.26.0)
Cache types:
Cache directory: ('unknown', 'none')
Caches: local
Remotes: None
Workspace directory: ('unknown', 'none')
Repo: dvc, git
Additional Information (if any):
Here are the logs (note : “accès refusé” means “access not permitted”) :
2021-07-28 15:33:50,749 DEBUG: Removing 'random_images\test\img_with_labels_batch_0_img_2.jpg'
2021-07-28 15:33:50,801 DEBUG: state save (3148873755809107765, 1624811388452225792, 25167) 86054b29dbbdec89e748fe596058c843
2021-07-28 15:33:50,902 DEBUG: 'random_images\test\img_with_labels_batch_0_img_6.jpg' file already exists, skipping
2021-07-28 15:33:50,903 DEBUG: Removing 'random_images\test\img_with_labels_batch_0_img_6.jpg'
2021-07-28 15:33:50,934 DEBUG: state save (3150879666208347930, 1624811388499105280, 2259) d00fece80aab30a39148e4418ce4ca6a
2021-07-28 15:33:51,182 ERROR: unexpected error - [WinError -2147024891] Acc▒s refus▒
Traceback (most recent call last):
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 566, in move
os.rename(src, real_dst)
PermissionError: [WinError 32] Le processus ne peut pas acc▒der au fichier car ce fichier est utilis▒ par un autre processus: 'N:\\Projets01\\STAGE\\test_dvc_isma_5\\.dvc\\cache\\d0\\0fece80aab30a39148e4418ce4ca6a.a3eEysHxDTQD2RF3DvqSdQ' -> 'N:\\Projets01\\STAGE\\test_dvc_isma_5\\.dvc\\cache\\d0\\0fece80aab30a39148e4418ce4ca6a'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\main.py", line 55, in main
ret = cmd.do_run()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\command\base.py", line 50, in do_run
return self.run()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\command\add.py", line 32, in run
jobs=self.args.jobs,
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\__init__.py", line 50, in wrapper
return f(repo, *args, **kwargs)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
return method(repo, *args, **kw)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\add.py", line 131, in add
**kwargs,
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\repo\add.py", line 195, in _process_stages
stage.commit()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\funcy\decorators.py", line 45, in wrapper
return deco(call, *dargs, **dkwargs)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
return call()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\funcy\decorators.py", line 66, in __call__
return self._func(*self._args, **self._kwargs)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\stage\__init__.py", line 492, in commit
out.commit(filter_info=filter_info)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\output.py", line 567, in commit
objects.save(self.odb, obj)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\objects\__init__.py", line 29, in save
future.result()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\_base.py", line 428, in result
return self.__get_result()
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\objects\db\base.py", line 59, in add
self.fs.move(path_info, cache_info)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\fs\local.py", line 97, in move
move(from_info, to_info)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\dvc\utils\fs.py", line 110, in move
shutil.move(tmp, dst)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 580, in move
copy_function(src, real_dst)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\shutil.py", line 266, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "c:\users\elatifi\miniforge3\envs\py37-tf250\lib\site-packages\speedcopy\__init__.py", line 289, in copyfile
'\\\\?\\' + dest_file, None)
File "_ctypes/callproc.c", line 922, in GetResult
PermissionError: [WinError -2147024891] Acc▒s refus▒`
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (6 by maintainers)

Top Related StackOverflow Question
@efiop Just spoke to IT as I couldn’t speak authoritatively about the network configuration. I was told accessing the NAS is via SMB protocol, not NFS
I am also running into this problem. I can reproduce the error as described and I get the same verbose output as @louistransfer.
I’ve done some debugging and I’m thinking it is some sort of a race condition. While debugging I saw this error which wasn’t shown in the verbose output:
(Note that the file already existed as I had duplicate files that I was adding)
I saw in the verbose stack trace that
dvc.objects.__init__.save()was being called, which usesconcurrent.futures.ThreadPoolExecutor. I tried to limit to 1 thread usingdvc add images/ -v --jobs 1, but apparently--jobsrequires--to-remoteto be used.Instead I put a breakpoint inside of
dvc.objects.__init__.save()and setjobs= 1 manually. Doing this prevented the files from being removed.Environment
dvc doctorAlso, I’ve ran
pip listand I do not havespeedcopyinstalled