typed.List 2x slower than reflected list (reproduced on Intel and AMD CPUs)
See original GitHub issueReporting a bug
- I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
- I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.
You need numpy, scipy and numba to run the code.
The only difference between the different process are the way individual stencils are accessed: though iterators in process, getitem_unchecked in process2 (typed List only) and normal getitem in process3 (reflected list only).
Code (bug.py)
import numba as nb
import numpy as np
from numba.extending import intrinsic
from numba.types import ir
from time import perf_counter_ns
from scipy.ndimage import gaussian_filter
@intrinsic
def addmod_tuples(tyctx, tpl1, tpl2, shape):
"""Returns tuple((t1+t2)%s for t1, t2, s in zip(tpl1, tpl2, shape))"""
ret = tpl1
count = tpl1.count
dtype = ret.dtype
typ = ir.types.IntType(dtype.bitwidth)
zero = ir.Constant(typ, 0)
def codegen(cgctx, builder, sig, args):
val_tpl1, val_tpl2, val_shape = args
tup = cgctx.get_constant_undef(ret)
for i in range(count):
# Extract scalars
t1 = builder.extract_value(val_tpl1, i)
t2 = builder.extract_value(val_tpl2, i)
s = builder.extract_value(val_shape, i)
# Modulo
bb_main = builder.block
main_val = builder.add(t1, t2)
overflow = builder.icmp_signed(">=", main_val, s, name='')
underflow = builder.icmp_signed("<", main_val, zero, name='')
bb_main = builder.block
with builder.if_then(overflow, likely=False):
over_val = builder.sub(main_val, s)
bb_overflow = builder.block
val_ = builder.phi(typ)
val_.add_incoming(main_val, bb_main)
val_.add_incoming(over_val, bb_overflow)
bb_main2 = builder.block
with builder.if_then(underflow, likely=False):
under_val = builder.add(main_val, s)
bb_underflow = builder.block
# Phi node: val = (t1+t2)%s
val = builder.phi(typ)
val.add_incoming(val_, bb_main2)
val.add_incoming(under_val, bb_underflow)
# Assign to tuple
tup = builder.insert_value(tup, val, i)
return tup
sig = ret(tpl1, tpl2, shape)
return sig, codegen
@nb.jit
def process(arr, stencils):
shape = arr.shape
direction = np.empty(arr.shape, dtype=np.int8)
for index in nb.pndindex(shape):
old_g = -np.inf
cdir = -1
g_c = arr[index]
## THIS IS THE MAIN CHANGING PART:
for i, stencil in enumerate(stencils):
## END OF CHANGING PART
nindex = addmod_tuples(index, stencil, shape)
g_n = arr[nindex]
new_g = (g_n - g_c)
if old_g < new_g:
old_g = new_g
if new_g > 0.0:
# Saves i (as in 'stencils[i]') as main direction
cdir = i
if (new_g < 0.0) and (cdir == -1):
# Local maximum
cdir = -2
direction[index] = cdir
return direction
@nb.jit
def process2(arr, stencils):
shape = arr.shape
direction = np.empty(arr.shape, dtype=np.int8)
N = len(stencils)
for index in nb.pndindex(shape):
old_g = -np.inf
cdir = -1 # No gradient, default
g_c = arr[index]
## THIS IS THE MAIN CHANGING PART:
for i in range(N):
stencil = stencils.getitem_unchecked(i)
## END OF CHANGING PART
nindex = addmod_tuples(index, stencil, shape)
g_n = arr[nindex]
new_g = (g_n - g_c)
if old_g < new_g:
old_g = new_g
if new_g > 0.0:
# Saves i (as in 'stencils[i]') as main direction
cdir = i
if (new_g < 0.0) and (cdir == -1):
# Local maximum
cdir = -2
direction[index] = cdir
return direction
@nb.jit
def process3(arr, stencils):
shape = arr.shape
direction = np.empty(arr.shape, dtype=np.int8)
N = len(stencils)
for index in nb.pndindex(shape):
old_g = -np.inf
cdir = -1 # No gradient, default
g_c = arr[index]
## THIS IS THE MAIN CHANGING PART:
for i in range(N):
stencil = stencils[i]
## END OF CHANGING PART
nindex = addmod_tuples(index, stencil, shape)
g_n = arr[nindex]
new_g = (g_n - g_c)
if old_g < new_g:
old_g = new_g
if new_g > 0.0:
# Saves i (as in 'stencils[i]') as main direction
cdir = i
if (new_g < 0.0) and (cdir == -1):
# Local maximum
cdir = -2
direction[index] = cdir
return direction
# Generate stencils
r = (-1, 0, 1)
stencils_refl = [(i, j, k) for i in r for j in r for k in r if (i, j, k) != (0, 0, 0)]
stencils_typed = nb.typed.List()
for s in stencils_refl:
stencils_typed.append(s)
# Generate dummy data
shape = (259, 259, 259)
arr = np.zeros(shape, dtype=np.float64)
for _ in range(3):
arr += (np.random.rand(*shape) > 0.95).astype(np.float64)
arr = gaussian_filter(arr, 10.0, mode="wrap")
# JIT warm-up
d1 = process(arr, stencils_refl)
d2 = process(arr, stencils_typed)
d3 = process2(arr, stencils_typed)
d4 = process3(arr, stencils_refl)
assert np.all(d1 == d2)
assert np.all(d2 == d3)
assert np.all(d3 == d4)
# Helper function for benchmarking
def bench(n, /, f, *args, **kwargs):
tot = 0.0
for _ in range(n):
tic = perf_counter_ns()
f(*args, **kwargs)
toc = perf_counter_ns()
tot += (toc - tic)
return tot / n
# Benchmark
print( "============================================")
print( " List type implementation time (ms)")
print( "--------------------------------------------")
t = bench(3, process, arr, stencils_refl)
print(f" reflected iterators {t/1e6:8.5f}")
t = bench(3, process, arr, stencils_typed)
print(f" typed iterators {t/1e6:8.5f}")
t = bench(3, process2, arr, stencils_typed)
print(f" typed getitem {t/1e6:8.5f}")
t = bench(3, process3, arr, stencils_refl)
print(f" reflected getitem {t/1e6:8.5f}")
print( "--------------------------------------------")
With real-life data and a slightly different implementation (adding complexity for nothing), I get
============================================
List type implementation time (ms)
--------------------------------------------
reflected iterators 1236.13449
typed iterators 2368.16592
typed getitem 2299.75980
reflected getitem 2030.11379
--------------------------------------------
With the exact code I provided, I get
============================================
List type implementation time (ms)
--------------------------------------------
reflected iterators 1898.99720
typed iterators 2673.21427
typed getitem 2933.83761
reflected getitem 2405.43470
--------------------------------------------
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (3 by maintainers)
Top Results From Across the Web
Why is Numpy with Ryzen Threadripper so much slower than ...
On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores. I am using the same Conda environment on all...
Read more >A history of Intel vs. AMD desktop performance, with CPU ...
Enlarge / Spoiler: When it comes to performance over the years, Intel is the slow and steady tortoise to AMD's speedy-but-intermittent hare.
Read more >Troubleshooting and tips — Numba 0.50.1 documentation
What is not possible is to have a list which is defined as empty and has no inferable type (i.e. an untyped list)....
Read more >3. The microarchitecture of Intel, AMD and VIA CPUs
See the literature list in manual 2: "Optimizing subroutines in ... to the µop cache and the instruction cache, and a slower method...
Read more >Processor Specifications | Microprocessor Types ... - InformIT
This type of design was then quickly adopted by the second generation ... Table 3.4 lists the Intel-compatible processors from AMD, Cyrix, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks for the update. I can reproduce similar, but not the same, results.
Just for future, in the above https://github.com/numba/numba/issues/7925#issuecomment-1073969571 the intermediate representation that’s being diff’d is Numba IR, not LLVM IR. The way Numba represents the bytecode for these functions as Numba IR prior to any transformation should be invariant of the type(s). There are transforms that can take place (both with/without type information), but I don’t think that anything significant would be applicable here.
I think any issues identified in the above are likely to be quite involved and will require looking into the generated LLVM IR more deeply. From a quick look, it seems like the “faster” loops are potentially running more quickly because the most often taken route through the loop body has no reference counting operations present. It could also be that LLVM has managed to “prove” something about the loop bounds in the enumerate/iter cases and has optimised further.
Numba has tools to help look at this sort of thing if you are interested. Adding the
debug=Trueflag to the@jitdecorator and then doing e.g.:will show the LLVM CFG for the first signature compiled for the function
process_enumeratelocally as a pdf document (view=Truedoes this), docs: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#Dispatcher.inspect_cfg.This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.