question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] set omp_num_threads = num_cpus per trial

See original GitHub issue

What is the problem?

Having successfully set up a cluster with m5.4xlarge worker nodes, each having 16 cpus, I noticed that my runs actually consume exactly 1 cpu on the worker nodes. Using resources_per_trial={"cpu": 16} does seem to allocate one run per worker as expected, but sshing into any single worker, only one core gets utilized.

Running the same calculation on the head node without ray does use multiple cores, so I’m pretty sure the problem is that ray somehow decided not to use all available compute.

Reproduction

import time
from ray import tune
import numpy as np
import ray

ray.init("172.31.29.113:6379")
def run_me(config):
    s = 3000
    for iter in range(100):
         #computationally expensive step (utilizes more than a single core on the head node running sequentially)
         np.matmul(np.random.randn(s,s),np.random.randn(s,s))
        
        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})

Config

python 3.6, ray 0.8.7. m5.4xlarge worker nodes on AWS

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
urimerhavcommented, Aug 16, 2020

Alright!

I tested and this works! I did indeed use the cluster launcher and it works.

Am I correct that I now need to set num_thread to match the amount of cores on my workers? I can live with that, but I think it’s a pretty important issue to resolve, as it’s only a matter of time until the env variable and number of cpus go out of sync.

As a temporary workaround, if anyone stumbles on this thread, here’s a solution that should be more stable, letting each worker node introspect on how many cores it has.

On the worker node startup script, add this line: export OMP_NUM_THREADS=$(cat /proc/cpuinfo | grep processor | wc -l)

This is tested to be working on ubuntu.

2reactions
richardliawcommented, Aug 15, 2020

No problem!

Got it; as a diagnostic step, can you do the following: If you’re starting ray by hand:

# when you start ray for each node of your cluster:
export OMP_NUM_THREADS=16
ray start --...

If you’re using the cluster launcher, you’ll want to set this in the yaml:

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --head --num-redis-shards=10 --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; export OMP_NUM_THREADS=16; ray start --address=$RAY_HEAD_IP:6379

And then run:

import time
from ray import tune
import ray
ray.init("172.31.29.113:6379")
def run_me(config):
    import numpy as np
    s = 3000
    for iter in range(100):

        a = np.matmul(np.random.randn(s,s),np.random.randn(s,s))
        
        tune.report(hello="world", ray="tune")

analysis = tune.run(run_me, num_samples=40, resources_per_trial={"cpu": 16})
Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuring Ray — Ray 2.2.0 - the Ray documentation
When starting Ray from the command line, pass the --num-cpus and --num-gpus ... You can also override this by explicitly setting OMP_NUM_THREADS to...
Read more >
Environment variables for OpenMP - IBM
The OMP_NUM_THREADS environment variable specifies the number of threads to use for parallel regions. ... If you do not set OMP_NUM_THREADS, the number...
Read more >
Mastering OpenMP Performance
will happen on 10 cores, or 20 cores, or … ... Never tune your code without using a profiling tool ... Trial And...
Read more >
FAQ — AutoGluon Documentation 0.4.0 documentation
Most of the models used by AutoGluon support GPU training, ... I add my own custom model to the set of models that...
Read more >
Running QMCPACK
Modern processors integrate multiple identical cores even with hardware threads on a single ... Nested threading is enabled by setting OMP_NUM_THREADS=AA,BB ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found