PartialTestResult.join_results(result, pickle.load(input)) EOFError libgomp: Thread creation failed: Resource temporarily unavailable
See original GitHub issueI’m experimenting with parallel execution of tests in the Fedora buildsystem (building Cython 0.29.3).
Up until now the test were disabled because of #1982 but I have decided to enable them and only skip the failing tests on Big Endian. The test are quite slow so I’ve decided to use -j$(nproc)
equivalent to speed it up.
The number of CPUs is however quite arbitrary and differs with each build. A i686 builder that was picked god 48 CPUs, so it used -j48
and failed.
We run python2 tests before python3 tests, so this is where I got a strange error. Let me know if I shall reverse the order to see if this happens on Python 3 as well.
$ /usr/bin/python2 runtests.py -vv -j48
...
======================================================================
ERROR: runTest (__main__.CythonRunTestCase)
compiling (c) and running parallel
----------------------------------------------------------------------
Traceback (most recent call last):
File "runtests.py", line 1266, in run
self.run_tests(result, ext_so_path)
File "runtests.py", line 1284, in run_tests
self.run_doctests(self.module, result, ext_so_path)
File "runtests.py", line 1296, in run_doctests
run_forked_test(result, run_test, self.shortDescription(), self.fork)
File "runtests.py", line 1362, in run_forked_test
PartialTestResult.join_results(result, pickle.load(input))
EOFError
======================================================================
ERROR: runTest (__main__.CythonRunTestCase)
compiling (cpp) and running parallel
----------------------------------------------------------------------
Traceback (most recent call last):
File "runtests.py", line 1266, in run
self.run_tests(result, ext_so_path)
File "runtests.py", line 1284, in run_tests
self.run_doctests(self.module, result, ext_so_path)
File "runtests.py", line 1296, in run_doctests
run_forked_test(result, run_test, self.shortDescription(), self.fork)
File "runtests.py", line 1362, in run_forked_test
PartialTestResult.join_results(result, pickle.load(input))
EOFError
----------------------------------------------------------------------
Ran 179 tests in 147.144s
FAILED (errors=2)
Full log: build.log
This error did not occur another time when the builder had just 6 CPUs and -j6
was used.
I’ve tried to limit the number to 16, however I got the same error with -j16
on a 18 core builder.
Currently I’m experimenting with -j7
(inspired by your Travis CI config) and I will report back.
I’ve only experienced this on i686, yet this was the only builder that I got with 48 CPUs this time. A x86_64 build with -j16
have made it without the error, however the error might not be deterministic.
My very wild guess is that with massive parallelism, the IO is not so fast and something reads a pickle jar too soon.
Issue Analytics
- State:
- Created 5 years ago
- Comments:17 (4 by maintainers)
Top GitHub Comments
According to the log (thanks for providing the full output), the test is failing with this error:
BUILDSTDERR: libgomp: Thread creation failed: Resource temporarily unavailable
That suggests that OpenMP fails to start its threads for some reason. I would recommend reducing the number of processes relative to the number of cores, since the test runner will also fork out the test runs and some tests will start threads or further subprocesses.
This page also suggests that passing
OMP_NESTED=FALSE
might limit the overall number of threads, but I can’t say if that breaks any of the tests as they might depend on starting new ones (don’t know).That explains a lot. I thought that “parallel” here is about the
-j
thing and that why it acted like a red flag from me. Using-x run.parallel
gets the job done.