Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] set omp_num_threads = num_cpus per trial

See original GitHub issue

What is the problem?

Having successfully set up a cluster with m5.4xlarge worker nodes, each having 16 cpus, I noticed that my runs actually consume exactly 1 cpu on the worker nodes. Using resources_per_trial={"cpu": 16} does seem to allocate one run per worker as expected, but sshing into any single worker, only one core gets utilized.

Running the same calculation on the head node without ray does use multiple cores, so I’m pretty sure the problem is that ray somehow decided not to use all available compute.

Reproduction

import time
from ray import tune
import numpy as np
import ray

ray.init("172.31.29.113:6379")
def run_me(config):
    s = 3000
    for iter in range(100):
         #computationally expensive step (utilizes more than a single core on the head node running sequentially)
         np.matmul(np.random.randn(s,s),np.random.randn(s,s))
        
        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

Config

python 3.6, ray 0.8.7. m5.4xlarge worker nodes on AWS

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

urimerhavcommented, Aug 16, 2020

Alright!

I tested and this works! I did indeed use the cluster launcher and it works.

Am I correct that I now need to set num_thread to match the amount of cores on my workers? I can live with that, but I think it’s a pretty important issue to resolve, as it’s only a matter of time until the env variable and number of cpus go out of sync.

As a temporary workaround, if anyone stumbles on this thread, here’s a solution that should be more stable, letting each worker node introspect on how many cores it has.

On the worker node startup script, add this line: export OMP_NUM_THREADS=$(cat /proc/cpuinfo | grep processor | wc -l)

This is tested to be working on ubuntu.

2reactions

richardliawcommented, Aug 15, 2020

No problem!

Got it; as a diagnostic step, can you do the following: If you’re starting ray by hand:

# when you start ray for each node of your cluster:
export OMP_NUM_THREADS=16
ray start --...

If you’re using the cluster launcher, you’ll want to set this in the yaml:

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --head --num-redis-shards=10 --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --address=$RAY_HEAD_IP:6379

And then run:

import time
from ray import tune
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
    import numpy as np
    s = 3000
    for iter in range(100):

        a = np.matmul(np.random.randn(s,s),np.random.randn(s,s))
        
        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

Top Results From Across the Web

Configuring Ray — Ray 2.2.0 - the Ray documentation

When starting Ray from the command line, pass the --num-cpus and --num-gpus ... You can also override this by explicitly setting OMP_NUM_THREADS to...

Environment variables for OpenMP - IBM

The OMP_NUM_THREADS environment variable specifies the number of threads to use for parallel regions. ... If you do not set OMP_NUM_THREADS, the number...

Mastering OpenMP Performance

will happen on 10 cores, or 20 cores, or … ... Never tune your code without using a profiling tool ... Trial And...

FAQ — AutoGluon Documentation 0.4.0 documentation

Most of the models used by AutoGluon support GPU training, ... I add my own custom model to the set of models that...

Running QMCPACK

Modern processors integrate multiple identical cores even with hardware threads on a single ... Nested threading is enabled by setting OMP_NUM_THREADS=AA,BB ...