[tune] set omp_num_threads = num_cpus per trial
See original GitHub issueWhat is the problem?
Having successfully set up a cluster with m5.4xlarge worker nodes, each having 16 cpus, I noticed that my runs actually consume exactly 1 cpu on the worker nodes. Using resources_per_trial={"cpu": 16}
does seem to allocate one run per worker as expected, but sshing into any single worker, only one core gets utilized.
Running the same calculation on the head node without ray does use multiple cores, so I’m pretty sure the problem is that ray somehow decided not to use all available compute.
Reproduction
import time
from ray import tune
import numpy as np
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
s = 3000
for iter in range(100):
#computationally expensive step (utilizes more than a single core on the head node running sequentially)
np.matmul(np.random.randn(s,s),np.random.randn(s,s))
tune.report(hello="world", ray="tune")
analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})
Config
python 3.6, ray 0.8.7. m5.4xlarge worker nodes on AWS
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Configuring Ray — Ray 2.2.0 - the Ray documentation
When starting Ray from the command line, pass the --num-cpus and --num-gpus ... You can also override this by explicitly setting OMP_NUM_THREADS to...
Read more >Environment variables for OpenMP - IBM
The OMP_NUM_THREADS environment variable specifies the number of threads to use for parallel regions. ... If you do not set OMP_NUM_THREADS, the number...
Read more >Mastering OpenMP Performance
will happen on 10 cores, or 20 cores, or … ... Never tune your code without using a profiling tool ... Trial And...
Read more >FAQ — AutoGluon Documentation 0.4.0 documentation
Most of the models used by AutoGluon support GPU training, ... I add my own custom model to the set of models that...
Read more >Running QMCPACK
Modern processors integrate multiple identical cores even with hardware threads on a single ... Nested threading is enabled by setting OMP_NUM_THREADS=AA,BB ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Alright!
I tested and this works! I did indeed use the cluster launcher and it works.
Am I correct that I now need to set num_thread to match the amount of cores on my workers? I can live with that, but I think it’s a pretty important issue to resolve, as it’s only a matter of time until the env variable and number of cpus go out of sync.
As a temporary workaround, if anyone stumbles on this thread, here’s a solution that should be more stable, letting each worker node introspect on how many cores it has.
On the worker node startup script, add this line:
export OMP_NUM_THREADS=$(cat /proc/cpuinfo | grep processor | wc -l)
This is tested to be working on ubuntu.
No problem!
Got it; as a diagnostic step, can you do the following: If you’re starting ray by hand:
If you’re using the cluster launcher, you’ll want to set this in the yaml:
And then run: