many functions slow for bigger-than-cache arrays on linux from numpy>=1.16.5
See original GitHub issuePerformance is badly degraded for many operations, often when array sizes similar to L2/L3 cache size are involved. The effect is present for numpy>=1.16.5 on an Intel i9-9960X Ubuntu 18.04 system, not observed for any numpy version on an Intel i9-8950HK MacOS Mojave system.
Reproducing code example:
conda create -n py37np164 python=3.7 numpy=1.16.4
conda create -n py37np165 python=3.7 numpy=1.16.5
conda create -n py37np181 python=3.7 numpy=1.18.1
import sys
import numpy as np
from timeit import Timer
print(np.__version__, sys.version)
# results sensitive to hardware cache size
n = 5
sizes = [2**20, 2**21, 2**22, 2**23]
stmts = [
'np.zeros(({},))',
'np.random.rand({})',
'np.linspace(0,1,{})',
'np.exp(np.zeros(({},)))',
]
for stmt in stmts:
for size in sizes:
s = stmt.format(size)
print(s+':')
t = Timer(s, globals=globals()).timeit(n)
print(f'\t{size/n/t} elements/second')
output with 1.16.4:
1.16.4 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
61421511.05126992 elements/second
np.zeros((2097152,)):
88332565.33730087 elements/second
np.zeros((4194304,)):
27011333457.509125 elements/second
np.zeros((8388608,)):
55727273740.89583 elements/second
np.random.rand(1048576):
5537466.835408629 elements/second
np.random.rand(2097152):
4565838.5126173515 elements/second
np.random.rand(4194304):
5056596.831366348 elements/second
np.random.rand(8388608):
4877062.023493181 elements/second
np.linspace(0,1,1048576):
4726193.511268751 elements/second
np.linspace(0,1,2097152):
24889336.518190637 elements/second
np.linspace(0,1,4194304):
10163866.009809423 elements/second
np.linspace(0,1,8388608):
11765754.105458323 elements/second
np.exp(np.zeros((1048576,))):
54532313.29674471 elements/second
np.exp(np.zeros((2097152,))):
33393087.31574934 elements/second
np.exp(np.zeros((4194304,))):
37155269.74395583 elements/second
np.exp(np.zeros((8388608,))):
34553576.9698848 elements/second
output with 1.16.5:
1.16.5 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
60143502.91423442 elements/second
np.zeros((2097152,)):
87341513.42070064 elements/second
np.zeros((4194304,)):
19292941759.91131 elements/second
np.zeros((8388608,)):
54397048327.81841 elements/second
np.random.rand(1048576):
5473600.047150185 elements/second
np.random.rand(2097152):
18752.041155728602 elements/second
np.random.rand(4194304):
19324.038630861298 elements/second
np.random.rand(8388608):
14506.20855024289 elements/second
np.linspace(0,1,1048576):
3856487.073921443 elements/second
np.linspace(0,1,2097152):
16523085.001294544 elements/second
np.linspace(0,1,4194304):
11812.122234984105 elements/second
np.linspace(0,1,8388608):
13185.01070840559 elements/second
np.exp(np.zeros((1048576,))):
47433268.94274437 elements/second
np.exp(np.zeros((2097152,))):
107295.34448171785 elements/second
np.exp(np.zeros((4194304,))):
76042.71390730397 elements/second
np.exp(np.zeros((8388608,))):
92983.00220850644 elements/second
output with 1.18.1:
1.18.1 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
50735117.70738101 elements/second
np.zeros((2097152,)):
83684116.17091243 elements/second
np.zeros((4194304,)):
21767035415.033813 elements/second
np.zeros((8388608,)):
54589913512.271355 elements/second
np.random.rand(1048576):
8093427.641236661 elements/second
np.random.rand(2097152):
8188684.253048504 elements/second
np.random.rand(4194304):
13750.96922032222 elements/second
np.random.rand(8388608):
12809.948623963204 elements/second
np.linspace(0,1,1048576):
28700369.02352331 elements/second
np.linspace(0,1,2097152):
31411688.42840547 elements/second
np.linspace(0,1,4194304):
12422.915392248138 elements/second
np.linspace(0,1,8388608):
15747.353944479057 elements/second
np.exp(np.zeros((1048576,))):
6914276.304259895 elements/second
np.exp(np.zeros((2097152,))):
9039.82013534727 elements/second
np.exp(np.zeros((4194304,))):
15507.785512087936 elements/second
np.exp(np.zeros((8388608,))):
13965.641080531177 elements/second
Python/Numpy version
results of conda env export
(envs differ only in numpy and numpy-base versions)
name: py37np165
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- blas=1.0=mkl
- ca-certificates=2020.1.1=0
- certifi=2019.11.28=py37_0
- intel-openmp=2020.0=166
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libgfortran-ng=7.3.0=hdf63c60_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- mkl=2020.0=166
- mkl-service=2.3.0=py37he904b0f_0
- mkl_fft=1.0.15=py37ha843d7b_0
- mkl_random=1.1.0=py37hd6b4f25_0
- ncurses=6.1=he6710b0_1
- numpy=1.16.5=py37h7e9f1db_0
- numpy-base=1.16.5=py37hde5b4d6_0
- openssl=1.1.1d=h7b6447c_3
- pip=20.0.2=py37_1
- python=3.7.6=h0371630_2
- readline=7.0=h7b6447c_5
- setuptools=45.1.0=py37_0
- six=1.14.0=py37_0
- sqlite=3.31.1=h7b6447c_0
- tk=8.6.8=hbc83047_0
- wheel=0.34.2=py37_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=h7b6447c_3
OS/Hardware:
Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-37-generic x86_64)
Handle 0x0061, DMI type 4, 48 bytes
Processor Information
Socket Designation: LGA 2066 R4
Type: Central Processor
Family: Xeon
Manufacturer: Intel(R) Corporation
ID: 54 06 05 00 FF FB EB BF
Signature: Type 0, Family 6, Model 85, Stepping 4
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
DS (Debug store)
ACPI (ACPI supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
SS (Self-snoop)
HTT (Multi-threading)
TM (Thermal monitor supported)
PBE (Pending break enabled)
Version: Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
Voltage: 1.6 V
External Clock: 100 MHz
Max Speed: 4000 MHz
Current Speed: 3100 MHz
Status: Populated, Enabled
Upgrade: Other
L1 Cache Handle: 0x005E
L2 Cache Handle: 0x005F
L3 Cache Handle: 0x0060
Serial Number: Not Specified
Asset Tag: UNKNOWN
Part Number: Not Specified
Core Count: 16
Core Enabled: 16
Thread Count: 32
Characteristics:
64-bit capable
Multi-Core
Hardware Thread
Execute Protection
Enhanced Virtualization
Power/Performance Control
Handle 0x005E, DMI type 7, 19 bytes
Cache Information
Socket Designation: L1-Cache
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 1024 kB
Maximum Size: 1024 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Instruction
Associativity: 8-way Set-associative
Handle 0x005F, DMI type 7, 19 bytes
Cache Information
Socket Designation: L2-Cache
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Varies With Memory Address
Location: Internal
Installed Size: 16384 kB
Maximum Size: 16384 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 16-way Set-associative
Handle 0x0060, DMI type 7, 19 bytes
Cache Information
Socket Designation: L3-Cache
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Varies With Memory Address
Location: Internal
Installed Size: 22528 kB
Maximum Size: 22528 kB
Supported SRAM Types:
Synchronous
Installed SRAM Type: Synchronous
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: Fully Associative
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:37 (30 by maintainers)
Top Results From Across the Web
Numpy concatenate is slow: any alternative approach?
This is basically what is happening in all algorithms based on arrays. Each time you change the size of the array, it needs...
Read more >Computation on NumPy Arrays: Universal Functions
Computation on NumPy arrays can be very fast, or it can be very slow. ... It then introduces many of the most common...
Read more >“Vectorized” Operations: Optimized Computations on NumPy ...
Unary Functions The answer is that it maps the function over the array - applying to each element within the array, and producing...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@charris @pentschev did not have time to try building from source or setting
transparent_hugepage/enabled
today, but discovered something else. The issue appears related to swap/disk cache. We had been fitting models with stochastic gradient descent to large datasets. Data were being streamed continuously from disk, which was keeping disk cache memory and swap full. After killing these jobs, the issue disappeared. We think it was cache/swap specifically, because there was still plenty of available memory, disk bandwidth and idle cpu cores.To get this rolling. I have a PR gh-15769 to add an environment variable, but guess to disable it on kernels before 4.6. There seemed to be a slight preference for not guessing (I am personally a bit in favor of guessing, because it seems to me that we lose practically nothing if we just guess correctly most of the time.
Just to get a few opinions.