question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

typed.List 2x slower than reflected list (reproduced on Intel and AMD CPUs)

See original GitHub issue

Reporting a bug

  • I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
  • I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

You need numpy, scipy and numba to run the code.

The only difference between the different process are the way individual stencils are accessed: though iterators in process, getitem_unchecked in process2 (typed List only) and normal getitem in process3 (reflected list only).

Code (bug.py)
import numba as nb
import numpy as np
from numba.extending import intrinsic
from numba.types import ir
from time import perf_counter_ns
from scipy.ndimage import gaussian_filter

@intrinsic
def addmod_tuples(tyctx, tpl1, tpl2, shape):
    """Returns tuple((t1+t2)%s for t1, t2, s in zip(tpl1, tpl2, shape))"""
    ret = tpl1
    count = tpl1.count
    dtype = ret.dtype
    typ = ir.types.IntType(dtype.bitwidth)
    zero = ir.Constant(typ, 0)
    
    def codegen(cgctx, builder, sig, args):
        val_tpl1, val_tpl2, val_shape = args

        tup = cgctx.get_constant_undef(ret)
        
        for i in range(count):
            # Extract scalars
            t1 = builder.extract_value(val_tpl1, i)
            t2 = builder.extract_value(val_tpl2, i)
            s = builder.extract_value(val_shape, i)
            
            # Modulo
            bb_main = builder.block
            
            main_val = builder.add(t1, t2)
            
            overflow = builder.icmp_signed(">=", main_val, s, name='')
            underflow = builder.icmp_signed("<", main_val, zero, name='')
            
            bb_main = builder.block
            with builder.if_then(overflow, likely=False):
                over_val = builder.sub(main_val, s)
                bb_overflow = builder.block
                
            val_ = builder.phi(typ)
            val_.add_incoming(main_val, bb_main)
            val_.add_incoming(over_val, bb_overflow)
                
            bb_main2 = builder.block
            with builder.if_then(underflow, likely=False):
                under_val = builder.add(main_val, s)
                bb_underflow = builder.block
            
            # Phi node:  val = (t1+t2)%s
            val = builder.phi(typ)
            val.add_incoming(val_, bb_main2)
            val.add_incoming(under_val, bb_underflow)
            
            
            # Assign to tuple
            tup = builder.insert_value(tup, val, i)
        return tup
    sig = ret(tpl1, tpl2, shape)
    return sig, codegen

@nb.jit
def process(arr, stencils):
    
    shape = arr.shape     
    direction = np.empty(arr.shape, dtype=np.int8)
    
    for index in nb.pndindex(shape):
        old_g = -np.inf
        cdir = -1  
        g_c = arr[index]
        
        
        ## THIS IS THE MAIN CHANGING PART:
        for i, stencil in enumerate(stencils):
            ## END OF CHANGING PART
            
            nindex = addmod_tuples(index, stencil, shape)
            
            g_n = arr[nindex]
            
            new_g = (g_n - g_c)
            
            if old_g < new_g:
                old_g = new_g
                if new_g > 0.0:
                    # Saves i (as in 'stencils[i]') as main direction
                    cdir = i
            
            if (new_g < 0.0) and (cdir == -1):
                # Local maximum
                cdir = -2
        direction[index] = cdir
    return direction

@nb.jit
def process2(arr, stencils):
    
    shape = arr.shape     
    direction = np.empty(arr.shape, dtype=np.int8)
    
    N = len(stencils)
    
    for index in nb.pndindex(shape):
        old_g = -np.inf
        cdir = -1  # No gradient, default
        g_c = arr[index]

        ## THIS IS THE MAIN CHANGING PART:
        for i in range(N):
            stencil = stencils.getitem_unchecked(i)
            ## END OF CHANGING PART
            
            nindex = addmod_tuples(index, stencil, shape)
            
            g_n = arr[nindex]
            
            new_g = (g_n - g_c)
            
            if old_g < new_g:
                old_g = new_g
                if new_g > 0.0:
                    # Saves i (as in 'stencils[i]') as main direction
                    cdir = i
            
            if (new_g < 0.0) and (cdir == -1):
                # Local maximum
                cdir = -2
        direction[index] = cdir
    return direction

@nb.jit
def process3(arr, stencils):
    
    shape = arr.shape     
    direction = np.empty(arr.shape, dtype=np.int8)
    
    N = len(stencils)
    
    for index in nb.pndindex(shape):
        old_g = -np.inf
        cdir = -1  # No gradient, default
        g_c = arr[index]

        ## THIS IS THE MAIN CHANGING PART:
        for i in range(N):
            stencil = stencils[i]
            ## END OF CHANGING PART
            
            nindex = addmod_tuples(index, stencil, shape)
            
            g_n = arr[nindex]
            
            new_g = (g_n - g_c)
            
            if old_g < new_g:
                old_g = new_g
                if new_g > 0.0:
                    # Saves i (as in 'stencils[i]') as main direction
                    cdir = i
            
            if (new_g < 0.0) and (cdir == -1):
                # Local maximum
                cdir = -2
        direction[index] = cdir
    return direction


# Generate stencils
r = (-1, 0, 1)
stencils_refl = [(i, j, k) for i in r for j in r for k in r if (i, j, k) != (0, 0, 0)]
stencils_typed = nb.typed.List()
for s in stencils_refl:
    stencils_typed.append(s)
    
# Generate dummy data
shape = (259, 259, 259)
arr = np.zeros(shape, dtype=np.float64)

for _ in range(3):
    arr += (np.random.rand(*shape) > 0.95).astype(np.float64)
    arr = gaussian_filter(arr, 10.0, mode="wrap")


# JIT warm-up
d1 = process(arr, stencils_refl)
d2 = process(arr, stencils_typed)
d3 = process2(arr, stencils_typed)
d4 = process3(arr, stencils_refl)

assert np.all(d1 == d2)
assert np.all(d2 == d3)
assert np.all(d3 == d4)


# Helper function for benchmarking
def bench(n, /, f, *args, **kwargs):
    tot = 0.0
    for _ in range(n):
        tic = perf_counter_ns()
        f(*args, **kwargs)
        toc = perf_counter_ns()
        tot += (toc - tic)
    return tot / n


# Benchmark
print( "============================================")
print( " List type     implementation     time (ms)")
print( "--------------------------------------------")

t = bench(3, process, arr, stencils_refl)
print(f" reflected       iterators        {t/1e6:8.5f}")

t = bench(3, process, arr, stencils_typed)
print(f"   typed         iterators        {t/1e6:8.5f}")

t = bench(3, process2, arr, stencils_typed)
print(f"   typed          getitem         {t/1e6:8.5f}")

t = bench(3, process3, arr, stencils_refl)
print(f" reflected        getitem         {t/1e6:8.5f}")
print( "--------------------------------------------")

With real-life data and a slightly different implementation (adding complexity for nothing), I get

============================================
 List type     implementation     time (ms)
--------------------------------------------
 reflected       iterators        1236.13449
   typed         iterators        2368.16592
   typed          getitem         2299.75980
 reflected        getitem         2030.11379
--------------------------------------------

With the exact code I provided, I get

============================================
 List type     implementation     time (ms)
--------------------------------------------
 reflected       iterators        1898.99720
   typed         iterators        2673.21427
   typed          getitem         2933.83761
 reflected        getitem         2405.43470
--------------------------------------------

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
stuartarchibaldcommented, Mar 22, 2022

Thanks for the update. I can reproduce similar, but not the same, results.

#============================================
# List type     implementation     time (ms)
--------------------------------------------
 reflected       enumerate        409.48466
 reflected       iterators        421.33267

 reflected        getitem         425.20516
   typed         iterators        402.85790
   typed     getitem_unchecked    454.65983

   typed         enumerate        481.11610

   typed          getitem         1340.85965
============================================

Just for future, in the above https://github.com/numba/numba/issues/7925#issuecomment-1073969571 the intermediate representation that’s being diff’d is Numba IR, not LLVM IR. The way Numba represents the bytecode for these functions as Numba IR prior to any transformation should be invariant of the type(s). There are transforms that can take place (both with/without type information), but I don’t think that anything significant would be applicable here.

I think any issues identified in the above are likely to be quite involved and will require looking into the generated LLVM IR more deeply. From a quick look, it seems like the “faster” loops are potentially running more quickly because the most often taken route through the loop body has no reference counting operations present. It could also be that LLVM has managed to “prove” something about the loop bounds in the enumerate/iter cases and has optimised further.

Numba has tools to help look at this sort of thing if you are interested. Adding the debug=True flag to the @jit decorator and then doing e.g.:

process_enumerate.inspect_cfg(process_enumerate.signatures[0], view=True, interleave=True)

will show the LLVM CFG for the first signature compiled for the function process_enumerate locally as a pdf document (view=True does this), docs: https://numba.readthedocs.io/en/stable/reference/jit-compilation.html#Dispatcher.inspect_cfg.

0reactions
github-actions[bot]commented, May 17, 2022

This issue is marked as stale as it has had no activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with any updates and confirm that this issue still needs to be addressed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is Numpy with Ryzen Threadripper so much slower than ...
On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores. I am using the same Conda environment on all...
Read more >
A history of Intel vs. AMD desktop performance, with CPU ...
Enlarge / Spoiler: When it comes to performance over the years, Intel is the slow and steady tortoise to AMD's speedy-but-intermittent hare.
Read more >
Troubleshooting and tips — Numba 0.50.1 documentation
What is not possible is to have a list which is defined as empty and has no inferable type (i.e. an untyped list)....
Read more >
3. The microarchitecture of Intel, AMD and VIA CPUs
See the literature list in manual 2: "Optimizing subroutines in ... to the µop cache and the instruction cache, and a slower method...
Read more >
Processor Specifications | Microprocessor Types ... - InformIT
This type of design was then quickly adopted by the second generation ... Table 3.4 lists the Intel-compatible processors from AMD, Cyrix, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found