Folder deletion in pathTools appears to be accounting for potential parallelism oddly
See original GitHub issueIn the implementation of cleanPath()
there is this fun chunk of code:
The only reason I can imagine for attempting the deletion in a loop with delays and a broad try/except if for scenarios in which the folder deletion is being attempted in parallel, and/or targeting a directory structure on a shared network drive. If this is the case, then having every processor try to delete the folder and hoping for the best is a pretty sketchy way to go about it. If this is supposed to be possible in parallel, we should actually address the complexities of air-traffic control explicitly in MPI.
This would look something like having only one rank responsible for the deletion, while all others wait until the directory is apparently removed, and a barrier at the end to synchronize. The main concern here are questions like:
- do we expect all call sites to be collective? if not, any communication that we do may lead to deadlocks if not all processors in a communicator are trying to
cleanPath()
at the same time - are all processors in a communicator attempting to clear the same path? if not, one rank per desired path deletion will need to be responsible
These aren’t simple questions to answer from within the function, so it is likely that decisions like this should be made from the call site. Something like this is easy enough to do:
if armi.MPI_RANK == 0:
clearPath(path)
while os.path.exists(path):
sleep(0.1)
armi.MPI_COMM.barrier()
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
This may be separate from this, but I have experienced issues when writing a file to a drive location and then immediately trying to access it where I can an OSError. I wonder if we were to delete these sleep timers and run some cases if this would uncover other issues.
If you are working on it, I will assign it to you. Fair is fair.