thread.error: can't start new thread on UnivaGridEngine
See original GitHub issueToil failed when running on UnivaGridEngine cluster environment.
Environments
Red Hat Enterprise Linux release 6.9 (Santiago)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Error details
When I run the quick start on the official document with python helloWorld.py bucket
, it returned an error.
Traceback (most recent call last):
File "sort.py", line 242, in <module>
main()
File "sort.py", line 236, in main
memory=sortMemory))
File "/home/6br/anaconda3/envs/toil/lib/python2.7/site-packages/toil/common.py", line 762, in start
self._batchSystem = self.createBatchSystem(self.config)
File "/home/6br/anaconda3/envs/toil/lib/python2.7/site-packages/toil/common.py", line 911, in createBatchSystem
return batchSystemClass(**kwargs)
File "/home/6br/anaconda3/envs/toil/lib/python2.7/site-packages/toil/batchSystems/abstractGridEngineBatchSystem.py", line 283, in __init__
super(AbstractGridEngineBatchSystem, self).__init__(config, maxCores, maxMemory, maxDisk)
File "/home/6br/anaconda3/envs/toil/lib/python2.7/site-packages/toil/batchSystems/abstractBatchSystem.py", line 310, in __init__
config, config.maxLocalJobs, maxMemory, maxDisk)
File "/home/6br/anaconda3/envs/toil/lib/python2.7/site-packages/toil/batchSystems/singleMachine.py", line 134, in __init__
worker.start()
File "/home/6br/anaconda3/envs/toil/lib/python2.7/threading.py", line 736, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Usually, shared cluster machine limits the maximum memory. Toli forked worker threads until the number of threads reaches a certain number, which exceeds the memory limitation. So, it could not start a new thread. So, I would like to know why so many worker threads should be forked and how to change the number of worker threads.
The result of top command is following. It exceeds the number of core on our machine(24).
9585 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:02.07 ββ python sort.py bucket3
9689 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9688 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9687 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9686 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9685 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9684 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9683 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9682 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9681 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9680 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9679 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9678 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9677 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9676 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9675 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9674 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9673 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9672 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9671 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9670 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9669 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9668 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9667 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9666 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9665 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9664 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9663 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9662 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9661 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9660 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9659 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9658 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9657 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
9656 6br 20 0 5426M 42328 4836 S 0.0 0.0 0:00.00 ββ python sort.py bucket3
For now, I could run toil with --maxCores 1
,(it still remained to fork 12 worker threads and it is not fundamental solution.)
--gridOption gridEngine
ignores maxCores option
When I run python helloWorld.py bucket --batchSystem gridEngine --maxCores 1
, the same error thread.error: can't start new thread
occurred. Seemingly, whether or not I use the maxCores
option, worker threads are forked until the number reaches the maximum threads. So, it causes to raise thread.error: can't start new thread
.
Currently, I use toil with --debugWorker
and I succeed to run jobs on the UGE. But it is unnatural.
βIssue is synchronized with this Jira Story βIssue Number: TOIL-319
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top GitHub Comments
@6br PR #2347 should fix this issue if you specify
--maxLocalJobs
as needed instead of--maxCores
. Let me know if this does not fix the problem.This looks similar to #2323 where
--maxLocalJobs
isnβt doing what it ought to. Should have a fix soon.