question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Race condition when multiple processes try to compile a module at once

See original GitHub issue

Hi,

Great package by the way!

I’ve encountered an issue when multiple processes are spawned that all race to compile the same module. This can also occur when multiple processes are spawned on different hosts and share the same network filesystem. Such a situation is common when distributing work between multiple processes or hosts for AI or data analytics.

Here is a demonstration (in the shell):

echo '// cppimport
#include <pybind11/pybind11.h>

namespace py = pybind11;

int square(int x) {
    return x * x;
}

PYBIND11_MODULE(somecode, m) {
    m.def("square", &square);
}
/*
<%
setup_pybind11(cfg)
%>
*/' > somecode.cpp

echo 'import cppimport.import_hook
import somecode
somecode.square(9)' > test.py

rm somecode.cpython-*

for i in {1..100}; do python3 test.py & done

On my system around 4 out of 100 processes exit in an error. The shell output includes:

error: could not delete '/localdata/joshl/sandbox/somecode.cpython-36m-x86_64-linux-gnu.so': No such file or directory
...
Exit 1                  python3 test.py
...
Bus error               (core dumped) python3 test.py

These errors don’t appear when the binary already exists.


To mitigate this issue in our applications we have used a file lock so that only one process attempts to compile the module at one time. A process first checks if the binary file exists, otherwise attempts to obtain the file lock. If it can’t obtain the lock it waits until either the binary exists, can obtain the file lock or times out. Here is an example how it can be done (app code):

from cppimport.checksum import is_checksum_valid

binary_path = module_data['ext_path']
lock_path = binary_path + '.lock'

t = time()

while not (os.path.exists(binary_path) and is_checksum_valid(module_data)) and time() - t < timeout:
    try:
        with FileLock(lock_path, timeout=1):
            if os.path.exists(binary_path) and is_checksum_valid(module_data_new_path):
                break
            # BUILD BINARY
            template_and_build(filepath, module_data)
    except Timeout:
        logging.debug(f'{os.getpid()}: Could not obtain lock')
        sleep(1)

if not (os.path.exists(binary_path) and is_checksum_valid(module_data_new_path)):
    raise Exception(
        f'Could not compile binary as lock already taken and timed out. Lock file will be deleted: {lock_path}')

if os.path.exists(lock_path):
    with suppress(OSError):
        os.remove(lock_path)

It would be great if we could upstream the above to cppimport to prevent the race condition errors. If you are happy with this solution I could contribute the above to the appropriate place in cppimport.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
joshlkcommented, Jul 4, 2022

Sorry it’s taken so long - I had to jump through a bunch of internal hoops. Here is a PR: #71

0reactions
tbenthompsoncommented, Jun 9, 2022

Awesome!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to prevent a race condition when multiple processes ...
When several processes are run at once, this code causes just one to open and write to pyx_file (assuming pyx_file does not already...
Read more >
What is a Race Condition? - TechTarget
They occur when two computer program processes, or threads, attempt to access the same resource at the same time and cause problems in...
Read more >
Multiprocessing Race Conditions in Python
A race condition is a failure case where the behavior of the program is dependent upon the order of execution by two or...
Read more >
7.10. Avoid Race Conditions
Race conditions generally involve one or more processes accessing a shared resource (such a file or variable), where this multiple access has not...
Read more >
Race conditions and deadlocks - Visual Basic - Microsoft Learn
A race condition occurs when two threads access a shared variable at the same time. The first thread reads the variable, and the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found