Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make NRT stats counters optional, off by default

See original GitHub issue

Reporting a bug

I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

As discussed at last week’s public Numba meeting, https://github.com/numba/numba/wiki/Minutes_2022_06_07. Code where:

there are potentially quite a few allocations/deallocations in a function, often caused by array temporaries

and

the function is called by many native threads

results in a lot of pressure on the atomic stats counters defined here: https://github.com/numba/numba/blob/080633aa333e69781888eee6521f5e2fa6751c75/numba/core/runtime/nrt.c#L53 and their use as is commented out here:

diff --git a/numba/core/runtime/nrt.c b/numba/core/runtime/nrt.c
index 3a65c9b..213251a 100644
--- a/numba/core/runtime/nrt.c
+++ b/numba/core/runtime/nrt.c
@@ -182,7 +182,7 @@ void NRT_MemInfo_init(NRT_MemInfo *mi,void *data, size_t size,
     mi->external_allocator = external_allocator;
     NRT_Debug(nrt_debug_print("NRT_MemInfo_init mi=%p external_allocator=%p\n", mi, external_allocator));
     /* Update stats */
-    TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
+//     TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
 }
 
 NRT_MemInfo *NRT_MemInfo_new(void *data, size_t size,
@@ -321,7 +321,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
     NRT_Debug(nrt_debug_print("NRT_dealloc meminfo: %p external_allocator: %p\n", mi, mi->external_allocator));
     if (mi->external_allocator) {
         mi->external_allocator->free(mi, mi->external_allocator->opaque_data);
-        TheMSys.atomic_inc(&TheMSys.stats_free);
+//         TheMSys.atomic_inc(&TheMSys.stats_free);
     } else {
         NRT_Free(mi);
     }
@@ -329,7 +329,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
 
 void NRT_MemInfo_destroy(NRT_MemInfo *mi) {
     NRT_dealloc(mi);
-    TheMSys.atomic_inc(&TheMSys.stats_mi_free);
+//     TheMSys.atomic_inc(&TheMSys.stats_mi_free);
 }
 
 void NRT_MemInfo_acquire(NRT_MemInfo *mi) {
@@ -472,7 +472,7 @@ void* NRT_Allocate_External(size_t size, NRT_ExternalAllocator *allocator) {
         ptr = TheMSys.allocator.malloc(size);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External bytes=%zu ptr=%p\n", size, ptr));
     }
-    TheMSys.atomic_inc(&TheMSys.stats_alloc);
+//     TheMSys.atomic_inc(&TheMSys.stats_alloc);
     return ptr;
 }
 
@@ -486,7 +486,7 @@ void *NRT_Reallocate(void *ptr, size_t size) {
 void NRT_Free(void *ptr) {
     NRT_Debug(nrt_debug_print("NRT_Free %p\n", ptr));
     TheMSys.allocator.free(ptr);
-    TheMSys.atomic_inc(&TheMSys.stats_free);
+//     TheMSys.atomic_inc(&TheMSys.stats_free);
 }

I think this code that follows reproduces the situation described above (and at the public meeting) that applies pressure on the atomic counters (using the count_muons example as given at the public meeting):

import ctypes
import numba
from numba import cfunc, carray, types, njit, prange
from numba.extending import intrinsic
import numpy as np
from timeit import default_timer


@cfunc(types.intp(types.CPointer(types.float32)))
def count_muons(ptr):
    arr = carray(ptr, 10)
    return np.count_nonzero((arr > 1.) & (np.abs(arr) < 7.) & (arr > 0.))


x = np.arange(10, dtype=np.float32)

count = count_muons(x.ctypes.data_as(ctypes.POINTER(ctypes.c_float)))
print("cfunc call", count)


@intrinsic
def addr_as_float32ptr(tyctx, addr):
    sig = types.CPointer(types.float32)(types.intp)

    def codegen(cgctx, builder, sig, llargs):
        ret = builder.inttoptr(llargs[0], cgctx.get_value_type(sig.return_type))
        return ret
    return sig, codegen


@njit
def liveness_sink(*args):
    # keeps things that are only referenced by pointers alive
    pass


@njit(parallel=True)
def foo(n):
    acc0 = acc1 = acc2 = acc3 = 0
    for i in prange(n):
        input_arr = np.arange(10).astype(np.float32)
        addr = input_arr.ctypes.data
        ptr = addr_as_float32ptr(addr)
        acc0 += count_muons(ptr)
        acc1 += count_muons(ptr)
        acc2 += count_muons(ptr)
        acc3 += count_muons(ptr)
        liveness_sink(input_arr)
    return acc0 + acc1 + acc2 + acc3


# make this bigger to run more arrays
n = 4000000

tstart = default_timer()
out = foo(n)
tend = default_timer()
print("jit call", out)
assert out == n * 5 * 4 # n loops with 5 things counted in 4 accumulators

print(f"Elapsed time: {tend-tstart}")

Without the patch above the result is:

Elapsed time: 19.056538556000305

but with the patch above (i.e. atomic counters are disabled)

Elapsed time: 6.221216844001901

It will probably be easier to fix this after https://github.com/numba/numba/pull/8106 is merged.

CC @jpivarski

Issue Analytics

State:
Created a year ago
Comments:20 (16 by maintainers)

Top GitHub Comments

2reactions

stuartarchibaldcommented, Jul 8, 2022

Am hoping once #8235 is merged, this will be fixed. It is the final PR in a series of patches that have:

Made it so that the use of “debug” markers by memseting known bytes into allocations/deallocations is only on in a specific NRT debug mode. This patch also made it so that “slow” alignment checks were only used if the alignment requirement is not a power of 2. (#8200)
Moved the NRT to C++ and made it use <atomic> so as to simplify the code used to perform atomic operations, this also removed the injection of LLVM compiled function pointers at runtime (#8106)!
Switched off the atomic NRT allocation statistics counters by default, they can be switched back on with an environment variable (as used in Numba’s testing) (#8235).

All of these patches combined lead to the original MWR: https://github.com/numba/numba/issues/8156#issue-1269644106 running at a much improved 6.1s locally (was originally 19s). There should now be no pressure caused by waiting on atomic locks in the NRT or superfluous work being performed in the same, but debug capacity is still present through environment variables. Hope this helps!

1reaction

stuartarchibaldcommented, Jun 29, 2022

Sounds great!

Thanks for the input @aseyboldt.

I’m not entirely sure I understand how you do it yet. What do you mean by “the point of NRT initialization”? When that library is loaded by the dynamic linker the first time?

When Numba first compiles something needing the NRT, the NRT is initialised and the NRT dynamic library symbols are wired in. The code path starts here: https://github.com/numba/numba/blob/96224dd2480b9ad6d115be84decc8041e31ed55b/numba/core/runtime/nrt.py#L15

My understanding of how this could work would be something like this:

* When the package is installed (or the wheel is created, in setup.py) we create a shared library with symbols for allocating new objects. There could be two versions of those functions, one debugging version that increments counters and one that doesn't.

* At runtime of the numba library, when a numba function is compiled, we check if currently the `NUMBA_NRT_DEBUG` flag is set, and based on that we choose if we want to use the `_debug` version of the allocation functions when generating code.

* At runtime of the jited function it doesn't matter anymore if the `NUMBA_NRT_DEBUG` flag is set, it just uses the code that was generated in the previous step.

This is somewhere along the lines of what I’m considering, but long term I’m considering going much further as Numba has the LLVM JIT compiler at its disposal. Were what is the “NRT dynamic shared library” not a compiled library but actually just LLVM bitcode, then it can actually take part in optimisation as part of the code generated by Numba for a given function. The NRT initialisation routine could then just take a flag for “enable statistics counters” and LLVM would hopefully just constant-propagate this and optimise out branches that have the counters, depending on the value of that flag.

I guess you also thought about getting rid of the shared nrt library all together, so that the allocation functions are statically linked and can be seen by the optimizer? (But maybe only when NUMBA_NRT_DEBUG isn’t set?)

I think converting the NRT to bitcode lets Numba do this. The allocation functions like malloc/free will then be “seen” by LLVM as precisely those calls, which can permit optimisations like converting small constant sized malloc/free pairs in tight loops into stack allocation.