Race condition when multiple processes try to compile a module at once
See original GitHub issueHi,
Great package by the way!
I’ve encountered an issue when multiple processes are spawned that all race to compile the same module. This can also occur when multiple processes are spawned on different hosts and share the same network filesystem. Such a situation is common when distributing work between multiple processes or hosts for AI or data analytics.
Here is a demonstration (in the shell):
echo '// cppimport
#include <pybind11/pybind11.h>
namespace py = pybind11;
int square(int x) {
return x * x;
}
PYBIND11_MODULE(somecode, m) {
m.def("square", &square);
}
/*
<%
setup_pybind11(cfg)
%>
*/' > somecode.cpp
echo 'import cppimport.import_hook
import somecode
somecode.square(9)' > test.py
rm somecode.cpython-*
for i in {1..100}; do python3 test.py & done
On my system around 4 out of 100 processes exit in an error. The shell output includes:
error: could not delete '/localdata/joshl/sandbox/somecode.cpython-36m-x86_64-linux-gnu.so': No such file or directory
...
Exit 1 python3 test.py
...
Bus error (core dumped) python3 test.py
These errors don’t appear when the binary already exists.
To mitigate this issue in our applications we have used a file lock so that only one process attempts to compile the module at one time. A process first checks if the binary file exists, otherwise attempts to obtain the file lock. If it can’t obtain the lock it waits until either the binary exists, can obtain the file lock or times out. Here is an example how it can be done (app code):
from cppimport.checksum import is_checksum_valid
binary_path = module_data['ext_path']
lock_path = binary_path + '.lock'
t = time()
while not (os.path.exists(binary_path) and is_checksum_valid(module_data)) and time() - t < timeout:
try:
with FileLock(lock_path, timeout=1):
if os.path.exists(binary_path) and is_checksum_valid(module_data_new_path):
break
# BUILD BINARY
template_and_build(filepath, module_data)
except Timeout:
logging.debug(f'{os.getpid()}: Could not obtain lock')
sleep(1)
if not (os.path.exists(binary_path) and is_checksum_valid(module_data_new_path)):
raise Exception(
f'Could not compile binary as lock already taken and timed out. Lock file will be deleted: {lock_path}')
if os.path.exists(lock_path):
with suppress(OSError):
os.remove(lock_path)
It would be great if we could upstream the above to cppimport
to prevent the race condition errors. If you are happy with this solution I could contribute the above to the appropriate place in cppimport
.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Sorry it’s taken so long - I had to jump through a bunch of internal hoops. Here is a PR: #71
Awesome!