Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Folder deletion in pathTools appears to be accounting for potential parallelism oddly

See original GitHub issue

In the implementation of cleanPath() there is this fun chunk of code:

https://github.com/terrapower/armi/blob/54664d800db9eec1f6dff584d5eda65329048d49/armi/utils/pathTools.py#L266-L277

The only reason I can imagine for attempting the deletion in a loop with delays and a broad try/except if for scenarios in which the folder deletion is being attempted in parallel, and/or targeting a directory structure on a shared network drive. If this is the case, then having every processor try to delete the folder and hoping for the best is a pretty sketchy way to go about it. If this is supposed to be possible in parallel, we should actually address the complexities of air-traffic control explicitly in MPI.

This would look something like having only one rank responsible for the deletion, while all others wait until the directory is apparently removed, and a barrier at the end to synchronize. The main concern here are questions like:

do we expect all call sites to be collective? if not, any communication that we do may lead to deadlocks if not all processors in a communicator are trying to cleanPath() at the same time
are all processors in a communicator attempting to clear the same path? if not, one rank per desired path deletion will need to be responsible

These aren’t simple questions to answer from within the function, so it is likely that decisions like this should be made from the call site. Something like this is easy enough to do:

if armi.MPI_RANK == 0:
    clearPath(path)
while os.path.exists(path):
    sleep(0.1)
armi.MPI_COMM.barrier()

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

jakehadercommented, Sep 12, 2021

This may be separate from this, but I have experienced issues when writing a file to a drive location and then immediately trying to access it where I can an OSError. I wonder if we were to delete these sleep timers and run some cases if this would uncover other issues.

0reactions

john-sciencecommented, Dec 8, 2021

I’d like to try adding separate strategies for when MPI is/isn’t present, but it’s taking me a while to update & run test_mpiActions.py outside of tox. If this needs to be fixed in a more timely manner please feel free to take it.

If you are working on it, I will assign it to you. Fair is fair.