Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

many functions slow for bigger-than-cache arrays on linux from numpy>=1.16.5

See original GitHub issue

Performance is badly degraded for many operations, often when array sizes similar to L2/L3 cache size are involved. The effect is present for numpy>=1.16.5 on an Intel i9-9960X Ubuntu 18.04 system, not observed for any numpy version on an Intel i9-8950HK MacOS Mojave system.

Reproducing code example:

conda create -n py37np164 python=3.7 numpy=1.16.4
conda create -n py37np165 python=3.7 numpy=1.16.5
conda create -n py37np181 python=3.7 numpy=1.18.1

import sys
import numpy as np
from timeit import Timer
print(np.__version__, sys.version)
# results sensitive to hardware cache size
n = 5
sizes = [2**20, 2**21, 2**22, 2**23]
stmts = [
    'np.zeros(({},))',
    'np.random.rand({})',
    'np.linspace(0,1,{})',
    'np.exp(np.zeros(({},)))',
]
for stmt in stmts:
    for size in sizes:
        s = stmt.format(size)
        print(s+':')
        t = Timer(s, globals=globals()).timeit(n)
        print(f'\t{size/n/t} elements/second')

output with 1.16.4:

1.16.4 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        61421511.05126992 elements/second
np.zeros((2097152,)):
        88332565.33730087 elements/second
np.zeros((4194304,)):
        27011333457.509125 elements/second
np.zeros((8388608,)):
        55727273740.89583 elements/second
np.random.rand(1048576):
        5537466.835408629 elements/second
np.random.rand(2097152):
        4565838.5126173515 elements/second
np.random.rand(4194304):
        5056596.831366348 elements/second
np.random.rand(8388608):
        4877062.023493181 elements/second
np.linspace(0,1,1048576):
        4726193.511268751 elements/second
np.linspace(0,1,2097152):
        24889336.518190637 elements/second
np.linspace(0,1,4194304):
        10163866.009809423 elements/second
np.linspace(0,1,8388608):
        11765754.105458323 elements/second
np.exp(np.zeros((1048576,))):
        54532313.29674471 elements/second
np.exp(np.zeros((2097152,))):
        33393087.31574934 elements/second
np.exp(np.zeros((4194304,))):
        37155269.74395583 elements/second
np.exp(np.zeros((8388608,))):
        34553576.9698848 elements/second

output with 1.16.5:

1.16.5 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        60143502.91423442 elements/second
np.zeros((2097152,)):
        87341513.42070064 elements/second
np.zeros((4194304,)):
        19292941759.91131 elements/second
np.zeros((8388608,)):
        54397048327.81841 elements/second
np.random.rand(1048576):
        5473600.047150185 elements/second
np.random.rand(2097152):
        18752.041155728602 elements/second
np.random.rand(4194304):
        19324.038630861298 elements/second
np.random.rand(8388608):
        14506.20855024289 elements/second
np.linspace(0,1,1048576):
        3856487.073921443 elements/second
np.linspace(0,1,2097152):
        16523085.001294544 elements/second
np.linspace(0,1,4194304):
        11812.122234984105 elements/second
np.linspace(0,1,8388608):
        13185.01070840559 elements/second
np.exp(np.zeros((1048576,))):
        47433268.94274437 elements/second
np.exp(np.zeros((2097152,))):
        107295.34448171785 elements/second
np.exp(np.zeros((4194304,))):
        76042.71390730397 elements/second
np.exp(np.zeros((8388608,))):
        92983.00220850644 elements/second

output with 1.18.1:

1.18.1 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        50735117.70738101 elements/second
np.zeros((2097152,)):
        83684116.17091243 elements/second
np.zeros((4194304,)):
        21767035415.033813 elements/second
np.zeros((8388608,)):
        54589913512.271355 elements/second
np.random.rand(1048576):
        8093427.641236661 elements/second
np.random.rand(2097152):
        8188684.253048504 elements/second
np.random.rand(4194304):
        13750.96922032222 elements/second
np.random.rand(8388608):
        12809.948623963204 elements/second
np.linspace(0,1,1048576):
        28700369.02352331 elements/second
np.linspace(0,1,2097152):
        31411688.42840547 elements/second
np.linspace(0,1,4194304):
        12422.915392248138 elements/second
np.linspace(0,1,8388608):
        15747.353944479057 elements/second
np.exp(np.zeros((1048576,))):
        6914276.304259895 elements/second
np.exp(np.zeros((2097152,))):
        9039.82013534727 elements/second
np.exp(np.zeros((4194304,))):
        15507.785512087936 elements/second
np.exp(np.zeros((8388608,))):
        13965.641080531177 elements/second

Python/Numpy version

results of conda env export (envs differ only in numpy and numpy-base versions)

name: py37np165
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - blas=1.0=mkl
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py37_0
  - intel-openmp=2020.0=166
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - mkl=2020.0=166
  - mkl-service=2.3.0=py37he904b0f_0
  - mkl_fft=1.0.15=py37ha843d7b_0
  - mkl_random=1.1.0=py37hd6b4f25_0
  - ncurses=6.1=he6710b0_1
  - numpy=1.16.5=py37h7e9f1db_0
  - numpy-base=1.16.5=py37hde5b4d6_0
  - openssl=1.1.1d=h7b6447c_3
  - pip=20.0.2=py37_1
  - python=3.7.6=h0371630_2
  - readline=7.0=h7b6447c_5
  - setuptools=45.1.0=py37_0
  - six=1.14.0=py37_0
  - sqlite=3.31.1=h7b6447c_0
  - tk=8.6.8=hbc83047_0
  - wheel=0.34.2=py37_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3

OS/Hardware:

Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-37-generic x86_64)

Handle 0x0061, DMI type 4, 48 bytes
Processor Information
        Socket Designation: LGA 2066 R4
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel(R) Corporation
        ID: 54 06 05 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 85, Stepping 4
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (FXSAVE and FXSTOR instructions supported)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Multi-threading)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
        Voltage: 1.6 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 3100 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x005E
        L2 Cache Handle: 0x005F
        L3 Cache Handle: 0x0060
        Serial Number: Not Specified
        Asset Tag: UNKNOWN
        Part Number: Not Specified
        Core Count: 16
        Core Enabled: 16
        Thread Count: 32
        Characteristics:
                64-bit capable
                Multi-Core
                Hardware Thread
                Execute Protection
                Enhanced Virtualization
                Power/Performance Control

Handle 0x005E, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L1-Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 1024 kB
        Maximum Size: 1024 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Instruction
        Associativity: 8-way Set-associative

Handle 0x005F, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L2-Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Varies With Memory Address
        Location: Internal
        Installed Size: 16384 kB
        Maximum Size: 16384 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: 16-way Set-associative

Handle 0x0060, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L3-Cache
        Configuration: Enabled, Not Socketed, Level 3
        Operational Mode: Varies With Memory Address
        Location: Internal
        Installed Size: 22528 kB
        Maximum Size: 22528 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: Fully Associative

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:37 (30 by maintainers)

Top GitHub Comments

2reactions

victor-shepardsoncommented, Feb 11, 2020

@charris @pentschev did not have time to try building from source or setting transparent_hugepage/enabled today, but discovered something else. The issue appears related to swap/disk cache. We had been fitting models with stochastic gradient descent to large datasets. Data were being streamed continuously from disk, which was keeping disk cache memory and swap full. After killing these jobs, the issue disappeared. We think it was cache/swap specifically, because there was still plenty of available memory, disk bandwidth and idle cpu cores.

1reaction

sebergcommented, May 2, 2020

To get this rolling. I have a PR gh-15769 to add an environment variable, but guess to disable it on kernels before 4.6. There seemed to be a slight preference for not guessing (I am personally a bit in favor of guessing, because it seems to me that we lose practically nothing if we just guess correctly most of the time.

Just to get a few opinions.