Make HistGradientBoostingRegressor/Classifer use _openmp_effective_n_threads to set the default maximum number of threads to use
See original GitHub issueIt has been reported several times that HistGradientBoostingClassifier
can suffer from sever over-subscription slowdowns when running on a machine with many cores in a docker container with CFS CPU quota (e.g. the typical CI server). The latest case is apparently #15824.
I think the easy fix would be use _openmp_effective_n_threads
to set the default limit. We could actually bound this number to 8 or 16 because of diminishing returns beyond 8 threads for this algorithms based on various benchmarks run for instance #14306.
This might be a reason to expose a user settable n_jobs
(or n_threads
?) for this estimator. By default if n_jobs=None
the number of threads would be set to the number of CPUs as detected by _openmp_effective_n_threads
(which also takes the environment variable into account) with a default limit to 8 or 16 to avoid wasting CPUs that could better be
If the user passes an explicit values for n_jobs
or sets the OMP_NUM_THREADS
env var, then this setting would be respected (to make it easy to run benchmark).
Most users would leave the default value and should benefit from reasonable parallelism and never suffer from catastrophic oversubscription. Advanced users should have control to use the exact number of threads they want, either via the n_jobs
parameters or an environment variable.
WDYT?
Issue Analytics
- State:
- Created 4 years ago
- Comments:37 (28 by maintainers)
Could be wrong, but IIRC
os.environ
doesn’t work as expected and it has no effect on packages that have already been imported, or something like that. Maybe try wit the%env
magic of the notebook, or switch to a script and runOMP_NUM_THREADS=1 python the_hist_gbdt_code.py
and you should not see more than 1 CPU being used@ogrisel A default generalizable value for the maximum default number of threads is difficult to put in practice for the following reasons (detailed):
I am sometimes using 128 threads ARM energy efficient servers and they do scale tasks surprisingly well because they are so slow compared to x86_64 (those CPUs are basically the equivalent of Intel Atom in performance).
This is “true” as long as the CPU is as “smart” as the newer CPUs… which is not the case. Sometimes on slower CPUs, the performance degradation is bigger than it should be (more true than true).
This is the case for instance of xgboost exact vs xgboost hist.
xgboost exact is “slow” and well parallelized (in the code), hence scales very well.
xgboost hist is “fast” and well parallelized (in the code), hence scales very poorly.
LightGBM is “fast” and well parallelized in another way (in the code), hence scales okayish. For instance, I can see 10000%+ efficiency vs 1 thread on 288 thread server on 1 billion observation data. 34.7% CPU efficiency scaling seems poor though (but 1% may be worth minutes / hours).
Most of machine learning is still done on “small” datasets, but they end so quickly it is difficult to perceive poor efficiency unless we notice something wrong (like extremely poor efficiency).
Usually NUMA penalty + remote RAM access penalty incurs poor efficiency, but also difficulty in choosing threads to run a program on: should you use 16 threads of the NUMA node 1 on a 32 hyperthreaded server, or 8 physical cores from each NUMA node?
There are also server CPUs with only 4 physical cores, while having 4 of them on the same motherboard (4 NUMA nodes): those edge cases are difficult to predict.
This is why people always use hyperthreads while rendering on CPU: not only it is a heavy compute task, but the tasks are so long the efficiency curve is nearly perfectly linear, hence hyperthreads will nearly always provide performance boost (example: Cinebench R15 Extreme).
But for GBDT, it depends as the tasks are usually way too small per thread.
Note that virtual machines may also be misconfigured by users (such as putting number of sockets (number of NUMA nodes) = number of hyperthreaded cores on host). Such misconfiguration from users does happen more often in practice than imaginable (I have seen myself someone put 448 sockets on a VM from a 448 thread host, and it did not end well for the VM on compute tasks).
And then, we also have the notion of “virtual core” which could mean anything nowadays.
Efficiency becomes negligible very quickly because it divides the singlethreaded time by the parallel time of the entire training. For instance, LightGBM scales on my 72-thread server up to 69 threads on Bosch dataset, but is going from 5.30s (36 threads) to 4.51s (69 threads) a good improvement vs 117.09s at 1 thread?
You may be limited by RAM
If the program wants to use more RAM bandwidth than it is available, the program will slowdown because CPU threads are going to wait longer than expected past a certain number of threads (recommended to use Intel VTune to investigate).
We might blame Intel/AMD for this (Turbo Boost / Turbo Core), but it allows at least to squeeze more performance from low number of threads. The architecture used and the instructions used also matter, as it may incur the use of different turbo frequency tables.
Example of difference on Intel Xeon Gold 5120:
1 thread = 3.2 GHz (turbo base), 3.1 GHz (turbo AVX-256), 2.9 GHz (turbo AVX-512) 14 thread = 2.6 GHz (turbo base), 2.2 GHz (turbo AVX-256), 1.6 GHz (turbo AVX-512)
Therefore, when running 14 thread vs 1 thread, we have 1 busy thread +23%/+41%/+81% faster than 1 busy thread of 14 busy threads.
Usually 3 reasons for this: power limitation, thermal limitation, voltage regulator module limitation (and three other rare cases: not putting enough A (Amp) for the CPU, not putting enough voltage on the CPU, or some mysterious stuff happening because of the motherboard which is common in laptops).
On consumer hardware, the typical limitation is power (desktops) and thermal (laptops). Power is usually impossible to override without motherboard support + CPU overclocking support. Because power/thermal limitations are difficult to hit with few threads, it may exacerbate poor efficiency on consumer hardware.
Hence usually consumer hardware scaling != server hardware scaling.
And then, we have different CPUs with different limitations (and sometimes those limitations are not even related to the CPUs), which usually leads to consumer hardware 1 != consumer hardware 2 (even if they are identical CPUs).
Example: repeated Cinebench R15 benchmarks on laptops seem to provide great CPU results at the beginning, but repeat it 30 times and the results are usually way poorer than expected (unless you are not thermal/power/VRM limited).
Another example: two (2) computers with an Intel Core i7-10510U may seem identical. If we assume no thermal/power/VRM limitation, it is possible one is not hitting 25W (TDP-up 2.3 GHz target) and the other not hitting 10W (TDP-down 800 MHz target). This is actually the reality in practice. And yet, they are both actually supposed to turbo to 4.9 GHz!
This changes the number of busy threads, hence impacts Turbo Boost / Turbo Core number of threads, and therefore may downclock further the CPU, lowering the expected efficiency.
Example: https://github.com/Laurae2/ml-perf/issues/6
For instance, I have seen differences ranging from zero difference to 400% faster just by changing compiler (and also use different compilation flags).
Example: +400% performance boost by switching from LLVM to Intel compiler for AVX-512 specific structure function kernels.
I have noticed this in very hot humid areas like it sometimes happen in USA: performance degrades significantly without any reason (whether it is a server or a desktop does not matter, they are all affected) when using multiple threads, hitting the efficiency very poorly.
This may also be due to the thermal paste requiring a change (hitting thermal limits way earlier than expected, hence poor efficiency).
Usually to put a sensible “default”, it would depend on:
It is difficult to provide a value which fits for every use case, as the combination of hardware + software + dataset used + environment is virtually unlimited (for each case, its own “best” value).
It is important to tell users there is no “best” value out of the box, but in general:
However, if a default is really mandatory, a reasonable parallelism threading value usually is the lowest the “end user” may have per true physical CPU socket. This value is currently 4 (8 if allowing hyperthreads). You may have a 16 thread server, which is actually a 4 NUMA node server and would suffer heavily from using 16 threads without NUMA optimizations (example: Intel Xeon Platinum 8156).
For handling CPU quotas, it is usually a good practice to leave an environment variable value which can be fetched by software in order to adjust some values (to optimize CPU quota usage). However, not sure in the “general public” how this would be done (but in business, this is a common practice to agree on the name and the usage of an environment variable to indicate burst usage only).
Or use
OMP_THREAD_LIMIT
if a temporary limit is required.