Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make HistGradientBoostingRegressor/Classifer use _openmp_effective_n_threads to set the default maximum number of threads to use

See original GitHub issue

It has been reported several times that HistGradientBoostingClassifier can suffer from sever over-subscription slowdowns when running on a machine with many cores in a docker container with CFS CPU quota (e.g. the typical CI server). The latest case is apparently #15824.

I think the easy fix would be use _openmp_effective_n_threads to set the default limit. We could actually bound this number to 8 or 16 because of diminishing returns beyond 8 threads for this algorithms based on various benchmarks run for instance #14306.

This might be a reason to expose a user settable n_jobs (or n_threads?) for this estimator. By default if n_jobs=None the number of threads would be set to the number of CPUs as detected by _openmp_effective_n_threads (which also takes the environment variable into account) with a default limit to 8 or 16 to avoid wasting CPUs that could better be

If the user passes an explicit values for n_jobs or sets the OMP_NUM_THREADS env var, then this setting would be respected (to make it easy to run benchmark).

Most users would leave the default value and should benefit from reasonable parallelism and never suffer from catastrophic oversubscription. Advanced users should have control to use the exact number of threads they want, either via the n_jobs parameters or an environment variable.

WDYT?

/cc @NicolasHug @jeremiedbb

Issue Analytics

State:
Created 4 years ago
Comments:37 (28 by maintainers)

Top GitHub Comments

2reactions

NicolasHugcommented, May 8, 2020

Could be wrong, but IIRC os.environ doesn’t work as expected and it has no effect on packages that have already been imported, or something like that. Maybe try wit the %env magic of the notebook, or switch to a script and run OMP_NUM_THREADS=1 python the_hist_gbdt_code.py and you should not see more than 1 CPU being used

2reactions

Laurae2commented, Jan 13, 2020

@ogrisel A default generalizable value for the maximum default number of threads is difficult to put in practice for the following reasons (detailed):

With very slow CPUs (ex: low voltage ARM CPUs focused on efficiency), it will be easy to scale past 128 threads while it could be difficult to scale past few threads on standard x86_64 architectures for the following reasons: because the tasks are so “small” (understand “short” in execution time), loss of efficiency is inevitable and scales with the inverse of the duration of the task => big tasks per thread in a parallel loop = easier to scale more threads => small tasks per thread in a parallel loop = harder to scale more threads

I am sometimes using 128 threads ARM energy efficient servers and they do scale tasks surprisingly well because they are so slow compared to x86_64 (those CPUs are basically the equivalent of Intel Atom in performance).

When a CPU becomes faster (for any reason), it becomes more difficult to scale because the tasks become “smaller” (“short” in execution time)

This is “true” as long as the CPU is as “smart” as the newer CPUs… which is not the case. Sometimes on slower CPUs, the performance degradation is bigger than it should be (more true than true).

When a program is improved to execute faster the idea of an algorithm, it becomes more difficult to scale because the tasks become “smaller” (“short” in execution time)

This is the case for instance of xgboost exact vs xgboost hist.

xgboost exact is “slow” and well parallelized (in the code), hence scales very well.

xgboost hist is “fast” and well parallelized (in the code), hence scales very poorly.

LightGBM is “fast” and well parallelized in another way (in the code), hence scales okayish. For instance, I can see 10000%+ efficiency vs 1 thread on 288 thread server on 1 billion observation data. 34.7% CPU efficiency scaling seems poor though (but 1% may be worth minutes / hours).

On small datasets (like iris), it will be very difficult to scale past few threads

Most of machine learning is still done on “small” datasets, but they end so quickly it is difficult to perceive poor efficiency unless we notice something wrong (like extremely poor efficiency).

When using more than 1 NUMA node (2 or more), it becomes increasingly difficult to maintain a linear efficiency curve becomes NUMA itself incurs latency and cache penalties

Usually NUMA penalty + remote RAM access penalty incurs poor efficiency, but also difficulty in choosing threads to run a program on: should you use 16 threads of the NUMA node 1 on a 32 hyperthreaded server, or 8 physical cores from each NUMA node?

There are also server CPUs with only 4 physical cores, while having 4 of them on the same motherboard (4 NUMA nodes): those edge cases are difficult to predict.

Hyperthreads may or may not be present on the CPU, however hyperthreads usually have a ~40% performance efficiency assuming a perfectly linear efficient program…

This is why people always use hyperthreads while rendering on CPU: not only it is a heavy compute task, but the tasks are so long the efficiency curve is nearly perfectly linear, hence hyperthreads will nearly always provide performance boost (example: Cinebench R15 Extreme).

But for GBDT, it depends as the tasks are usually way too small per thread.

Note that virtual machines may also be misconfigured by users (such as putting number of sockets (number of NUMA nodes) = number of hyperthreaded cores on host). Such misconfiguration from users does happen more often in practice than imaginable (I have seen myself someone put 448 sockets on a VM from a 448 thread host, and it did not end well for the VM on compute tasks).

And then, we also have the notion of “virtual core” which could mean anything nowadays.

Efficiency becomes negligible very quickly because it divides the singlethreaded time by the parallel time of the entire training. For instance, LightGBM scales on my 72-thread server up to 69 threads on Bosch dataset, but is going from 5.30s (36 threads) to 4.51s (69 threads) a good improvement vs 117.09s at 1 thread?
You may be limited by RAM

If the program wants to use more RAM bandwidth than it is available, the program will slowdown because CPU threads are going to wait longer than expected past a certain number of threads (recommended to use Intel VTune to investigate).

You may be limited the CPU frequency scaling, the efficiency of adding more threads is diminishing very quickly (but is not when frequency is kept constant)

We might blame Intel/AMD for this (Turbo Boost / Turbo Core), but it allows at least to squeeze more performance from low number of threads. The architecture used and the instructions used also matter, as it may incur the use of different turbo frequency tables.

Example of difference on Intel Xeon Gold 5120:

1 thread = 3.2 GHz (turbo base), 3.1 GHz (turbo AVX-256), 2.9 GHz (turbo AVX-512) 14 thread = 2.6 GHz (turbo base), 2.2 GHz (turbo AVX-256), 1.6 GHz (turbo AVX-512)

Therefore, when running 14 thread vs 1 thread, we have 1 busy thread +23%/+41%/+81% faster than 1 busy thread of 14 busy threads.

You may be throttled by the CPU, and the efficiency of threads is upside down

Usually 3 reasons for this: power limitation, thermal limitation, voltage regulator module limitation (and three other rare cases: not putting enough A (Amp) for the CPU, not putting enough voltage on the CPU, or some mysterious stuff happening because of the motherboard which is common in laptops).

On consumer hardware, the typical limitation is power (desktops) and thermal (laptops). Power is usually impossible to override without motherboard support + CPU overclocking support. Because power/thermal limitations are difficult to hit with few threads, it may exacerbate poor efficiency on consumer hardware.

Hence usually consumer hardware scaling != server hardware scaling.

And then, we have different CPUs with different limitations (and sometimes those limitations are not even related to the CPUs), which usually leads to consumer hardware 1 != consumer hardware 2 (even if they are identical CPUs).

Example: repeated Cinebench R15 benchmarks on laptops seem to provide great CPU results at the beginning, but repeat it 30 times and the results are usually way poorer than expected (unless you are not thermal/power/VRM limited).

Another example: two (2) computers with an Intel Core i7-10510U may seem identical. If we assume no thermal/power/VRM limitation, it is possible one is not hitting 25W (TDP-up 2.3 GHz target) and the other not hitting 10W (TDP-down 800 MHz target). This is actually the reality in practice. And yet, they are both actually supposed to turbo to 4.9 GHz!

Performance efficiency may change depending on what is currently running on the computer

This changes the number of busy threads, hence impacts Turbo Boost / Turbo Core number of threads, and therefore may downclock further the CPU, lowering the expected efficiency.

Example: https://github.com/Laurae2/ml-perf/issues/6

Compiler may compile more or less “better” than another compiler

For instance, I have seen differences ranging from zero difference to 400% faster just by changing compiler (and also use different compilation flags).

Example: +400% performance boost by switching from LLVM to Intel compiler for AVX-512 specific structure function kernels.

Environment may degrade performance

I have noticed this in very hot humid areas like it sometimes happen in USA: performance degrades significantly without any reason (whether it is a server or a desktop does not matter, they are all affected) when using multiple threads, hitting the efficiency very poorly.

This may also be due to the thermal paste requiring a change (hitting thermal limits way earlier than expected, hence poor efficiency).

Usually to put a sensible “default”, it would depend on:

Hardware
Software
Dataset used
Environment

It is difficult to provide a value which fits for every use case, as the combination of hardware + software + dataset used + environment is virtually unlimited (for each case, its own “best” value).

It is important to tell users there is no “best” value out of the box, but in general:

For desktops: number of threads = number of hyperthreads is usually the best (especially if the number of physical cores is low, like in laptops)
For workstations and servers: number of threads = number of hyperthreads, or number of physical cores, or less than number of physical cores, or any number may be the best (must test each to see what happens)

However, if a default is really mandatory, a reasonable parallelism threading value usually is the lowest the “end user” may have per true physical CPU socket. This value is currently 4 (8 if allowing hyperthreads). You may have a 16 thread server, which is actually a 4 NUMA node server and would suffer heavily from using 16 threads without NUMA optimizations (example: Intel Xeon Platinum 8156).

For handling CPU quotas, it is usually a good practice to leave an environment variable value which can be fetched by software in order to adjust some values (to optimize CPU quota usage). However, not sure in the “general public” how this would be done (but in business, this is a common practice to agree on the name and the usage of an environment variable to indicate burst usage only).

Or use OMP_THREAD_LIMIT if a temporary limit is required.

Top Results From Across the Web

OMP_NUM_THREADS - OpenMP

The OMP_NUM_THREADS environment variable sets the number of threads to use for parallel regions by setting the initial value of the nthreads-var ICV....

C H A P T E R 5 - Compiling for OpenMP

This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS to...

Lecture 12: Introduction to OpenMP (Part 1)

Set the default number of threads to use. • OMP_DYNAMIC TRUE|FALSE. – Can the program use a different number of threads in each...

Parallel Programming: OpenMP - Mathematical Sciences

OMP NUM THREADS. The default number of threads during the parallel region. OMP THREAD LIMIT. The number of OpenMP threads to use for...

Environment variables for OpenMP - IBM

If it is set to TRUE, the number of threads available for executing parallel regions can be adjusted at run time to make...