Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does Dynesty multiprocessing with ThreadPoolExecutor not use all cores?

See original GitHub issue

I am trying to run a simple example with nested sampling and dynesty.

I installed dynesty from github: https://github.com/joshspeagle/dynesty

Computer Setup

OS: Mac OSX El Capitan (10.11.6)
CPU: 8 cores
RAM: 16.0 GB

gcc: 4.8.5 via conda install gcc

Problem Setup

I ran the code below (simulate data, setup prior/likelihood, submit to dynesty.

To establish multiprocessing, I used multiprocessing.Pool, concurrent.futures.ProcessPoolExecutor, and concurrent.futures.ThreadPoolExecutor.

I tried the code in Jupyter Lab, ipython, and (script) python run_dynesty_test.py.

Problem: The entire script runs fine; but dynesty/python starts using all of the cores; then it slowly starts to use less and less cores. Finally, after about 5 minutes, dynesty/python uses almost exactly 1 core.

Evidence: htop starts reading 780%, then 550%, then 350%, then 100% CPU – and it stays at 100% CPU for the rest of the operation; except a once every other minute, htop will read ~250-300% CPU.

Question

Why is dynesty/ThreadPoolExecutor/python/etc not using all of my cores all of the time?

Code Snippet Involving Dynesty and Multiprocessing

with ThreadPoolExecutor(max_workers=cpu_count()-1) as executor:
    
    sampler = dynesty.DynamicNestedSampler(
                            loglike,
                            prior,
                            ndim=ndims,
                            nparam=ndims,
                            bound='multi',
                            sample='unit',
                            pool=executor,
                            queue_size=cpu_count())
    
    sampler.run_nested(nlive_init=100,nlive_batch=100)
    
    res = sampler.results

Full Script to Setup Test

from __future__ import absolute_import,\
              unicode_literals, print_function

from multiprocessing import set_start_method
set_start_method('forkserver')

import dynesty
import math
import os
import threading, subprocess

from sys import platform
from numpy import pi, sin, cos, linspace
from pylab import *#;ion()

from multiprocessing import Pool, cpu_count

if not os.path.exists("chains"): os.mkdir("chains")

# plotting
import matplotlib
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def gaussian1Dp(cube):
    center = cube[0]
    width  = cube[1]
    height = cube[2]
    return lambda y: height*np.exp(-0.5*(( (center - y) / width)**2))

np.random.seed(42)

param0a= -0.5
param0b= 0.5
param1a= 0.1
param1b= 0.1
param2a= 0.8
param2b= 0.8

yunc  = 0.1
nPts  = int(100)
nThPts= int(1e3)

xmin  = -1
xmax  =  1
dx    = 0.1*(xmax - xmin)

yuncs = np.random.normal(yunc, 1e-2 * yunc, nPts)
thdata= np.linspace(xmin-dx, xmax+dx, nThPts)

xdata = np.linspace(xmin,xmax,nPts)

ydata = model([param0a,param1a,param2a])(xdata) \
        + model([param0b,param1b,param2b])(xdata)

yerr  = np.random.normal(0, yuncs, nPts)
zdata = ydata + yerr

figure(figsize=(10,10))
plot(thdata, model([param0a,param1a,param2a])(thdata) \
        + model([param0b,param1b,param2b])(thdata))
errorbar(xdata, zdata, yunc*ones(zdata.size), fmt='o')
show()

def prior(cube):
    cube[0] = cube[0]*2 - 1
    cube[1] = cube[1]*2
    cube[2] = cube[2]*2
    
    return cube

def loglike(cube):
    modelNow = model(cube)(xdata)
    return -0.5*((modelNow - ydata)**2./yuncs**2.).sum()

from concurrent.futures import ThreadPoolExecutor,\
                               ProcessPoolExecutor

if __name__ == '__main__':
    if not os.path.exists("chains"): os.mkdir("chains")
    n_params = len(parameters)
    ndims = 3
    
    with ThreadPoolExecutor(max_workers=cpu_count()-1) as executor:
        
        sampler = dynesty.DynamicNestedSampler(
                                loglike,
                                prior,
                                ndim=ndims,
                                nparam=ndims,
                                bound='multi',
                                sample='unit',
                                pool=executor,
                                queue_size=cpu_count())
        
        sampler.run_nested(nlive_init=100, nlive_batch=100)
        
        res = sampler.results

from dynesty import plotting as dyplot

# evidence check
fig, axes = dyplot.runplot(res, color='red',
                lnz_truth=lnz_truth,
                truth_color='black',
                logplot=True)

fig.tight_layout()

joblib.dump(res,'dynesty_double_gaussian_test_results.joblib.save')

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

joshspeaglecommented, Oct 9, 2018

Yep, can confirm that dynesty is using all the cores on my machine. The variation in behavior is due to overhead in-between allocating batches (which requires operations done in serial), but during a single batch I can get up to full usage of the cores. My guess is you’re seeing ~50% usage overall just because evaluating the likelihood is essentially instantaneous so the overhead is taking up a significant portion of the runtime.

0reactions

exowanderercommented, Oct 5, 2018

The first thing to say is that I don’t know what virtual cores are. But I’m using

OSX 10.14 (Mojave) Macbook Pro Mid2014 2.5 GHz Intel Core i7 gcc version 4.8.5

multiprocessing.cpu_count() registers 8 cores; I think that is 4 cores with 2 threads each; but this is where I get confused.

I also tried it on 2 linux servers that I have access to:

RHEL6 x86_64 Kernel: 2.6.32-696.23.1.el6.x86_64 CPUs: 12 and 24 gcc 4.4.7

TL;DR In all 3 cases (Mac + 2 Linux) dynesty seems to use almost exactly half of my cpus (50% +\- 5%). It oscillated so consistently around 50% that it feels as though it were programmed to use exactly 50% full CPU usage, with room for statistical uncertainty. Sometimes it grows to 70% for < 1 minutes (~30 seconds); but falls back down.

With other software (i.e. tensorflow) stack overflow level information informed me that “some problems are not worth all of your CPUs and tensorflow decides that on the fly”. Is it possible that the multiprocessing library is not requesting all of the cpus? At the same time, you said that you tried my code above and it used all of your cpus.

Top Results From Across the Web

Why does Dynesty multiprocessing with ThreadPoolExecutor ...

Problem: The entire script runs fine; but dynesty/python starts using all of the cores; then it slowly starts to use less and less...

concurrent.futures — Launching parallel tasks — Python 3.11 ...

An Executor subclass that uses a pool of at most max_workers threads to execute calls asynchronously. All threads enqueued to ThreadPoolExecutor will be...

Why you should use ThreadPoolExecutor() instead ... - Medium

If your code is CPU bound : You should use multiprocessing (if your machine has multiple cores). References: https://tutorialedge.net/python/ ...

Multi-Core Parallelism — Computational Statistics in Python

Using processes in parallel with ThreadPoolExecutor¶. We do not get any speedup because the GIL only allows one thread to run at one...

Why Is the ThreadPoolExecutor Slower in Python?

What If We Use More Threads in the ThreadPoolExecutor? ... the ThreadPoolExecutor for a CPU-bound task can be slower than not using it....