Make NRT stats counters optional, off by default
See original GitHub issueReporting a bug
- I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
- I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.
As discussed at last week’s public Numba meeting, https://github.com/numba/numba/wiki/Minutes_2022_06_07. Code where:
- there are potentially quite a few allocations/deallocations in a function, often caused by array temporaries
and
- the function is called by many native threads
results in a lot of pressure on the atomic stats counters defined here: https://github.com/numba/numba/blob/080633aa333e69781888eee6521f5e2fa6751c75/numba/core/runtime/nrt.c#L53 and their use as is commented out here:
diff --git a/numba/core/runtime/nrt.c b/numba/core/runtime/nrt.c
index 3a65c9b..213251a 100644
--- a/numba/core/runtime/nrt.c
+++ b/numba/core/runtime/nrt.c
@@ -182,7 +182,7 @@ void NRT_MemInfo_init(NRT_MemInfo *mi,void *data, size_t size,
mi->external_allocator = external_allocator;
NRT_Debug(nrt_debug_print("NRT_MemInfo_init mi=%p external_allocator=%p\n", mi, external_allocator));
/* Update stats */
- TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
+// TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
}
NRT_MemInfo *NRT_MemInfo_new(void *data, size_t size,
@@ -321,7 +321,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
NRT_Debug(nrt_debug_print("NRT_dealloc meminfo: %p external_allocator: %p\n", mi, mi->external_allocator));
if (mi->external_allocator) {
mi->external_allocator->free(mi, mi->external_allocator->opaque_data);
- TheMSys.atomic_inc(&TheMSys.stats_free);
+// TheMSys.atomic_inc(&TheMSys.stats_free);
} else {
NRT_Free(mi);
}
@@ -329,7 +329,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
void NRT_MemInfo_destroy(NRT_MemInfo *mi) {
NRT_dealloc(mi);
- TheMSys.atomic_inc(&TheMSys.stats_mi_free);
+// TheMSys.atomic_inc(&TheMSys.stats_mi_free);
}
void NRT_MemInfo_acquire(NRT_MemInfo *mi) {
@@ -472,7 +472,7 @@ void* NRT_Allocate_External(size_t size, NRT_ExternalAllocator *allocator) {
ptr = TheMSys.allocator.malloc(size);
NRT_Debug(nrt_debug_print("NRT_Allocate_External bytes=%zu ptr=%p\n", size, ptr));
}
- TheMSys.atomic_inc(&TheMSys.stats_alloc);
+// TheMSys.atomic_inc(&TheMSys.stats_alloc);
return ptr;
}
@@ -486,7 +486,7 @@ void *NRT_Reallocate(void *ptr, size_t size) {
void NRT_Free(void *ptr) {
NRT_Debug(nrt_debug_print("NRT_Free %p\n", ptr));
TheMSys.allocator.free(ptr);
- TheMSys.atomic_inc(&TheMSys.stats_free);
+// TheMSys.atomic_inc(&TheMSys.stats_free);
}
I think this code that follows reproduces the situation described above (and at the public meeting) that applies pressure on the atomic counters (using the count_muons
example as given at the public meeting):
import ctypes
import numba
from numba import cfunc, carray, types, njit, prange
from numba.extending import intrinsic
import numpy as np
from timeit import default_timer
@cfunc(types.intp(types.CPointer(types.float32)))
def count_muons(ptr):
arr = carray(ptr, 10)
return np.count_nonzero((arr > 1.) & (np.abs(arr) < 7.) & (arr > 0.))
x = np.arange(10, dtype=np.float32)
count = count_muons(x.ctypes.data_as(ctypes.POINTER(ctypes.c_float)))
print("cfunc call", count)
@intrinsic
def addr_as_float32ptr(tyctx, addr):
sig = types.CPointer(types.float32)(types.intp)
def codegen(cgctx, builder, sig, llargs):
ret = builder.inttoptr(llargs[0], cgctx.get_value_type(sig.return_type))
return ret
return sig, codegen
@njit
def liveness_sink(*args):
# keeps things that are only referenced by pointers alive
pass
@njit(parallel=True)
def foo(n):
acc0 = acc1 = acc2 = acc3 = 0
for i in prange(n):
input_arr = np.arange(10).astype(np.float32)
addr = input_arr.ctypes.data
ptr = addr_as_float32ptr(addr)
acc0 += count_muons(ptr)
acc1 += count_muons(ptr)
acc2 += count_muons(ptr)
acc3 += count_muons(ptr)
liveness_sink(input_arr)
return acc0 + acc1 + acc2 + acc3
# make this bigger to run more arrays
n = 4000000
tstart = default_timer()
out = foo(n)
tend = default_timer()
print("jit call", out)
assert out == n * 5 * 4 # n loops with 5 things counted in 4 accumulators
print(f"Elapsed time: {tend-tstart}")
Without the patch above the result is:
Elapsed time: 19.056538556000305
but with the patch above (i.e. atomic counters are disabled)
Elapsed time: 6.221216844001901
It will probably be easier to fix this after https://github.com/numba/numba/pull/8106 is merged.
CC @jpivarski
Issue Analytics
- State:
- Created a year ago
- Comments:20 (16 by maintainers)
Am hoping once #8235 is merged, this will be fixed. It is the final PR in a series of patches that have:
memset
ing known bytes into allocations/deallocations is only on in a specific NRT debug mode. This patch also made it so that “slow” alignment checks were only used if the alignment requirement is not a power of 2. (#8200)C++
and made it use<atomic>
so as to simplify the code used to perform atomic operations, this also removed the injection of LLVM compiled function pointers at runtime (#8106)!All of these patches combined lead to the original MWR: https://github.com/numba/numba/issues/8156#issue-1269644106 running at a much improved 6.1s locally (was originally 19s). There should now be no pressure caused by waiting on atomic locks in the NRT or superfluous work being performed in the same, but debug capacity is still present through environment variables. Hope this helps!
Thanks for the input @aseyboldt.
When Numba first compiles something needing the NRT, the NRT is initialised and the NRT dynamic library symbols are wired in. The code path starts here: https://github.com/numba/numba/blob/96224dd2480b9ad6d115be84decc8041e31ed55b/numba/core/runtime/nrt.py#L15
This is somewhere along the lines of what I’m considering, but long term I’m considering going much further as Numba has the LLVM JIT compiler at its disposal. Were what is the “NRT dynamic shared library” not a compiled library but actually just LLVM bitcode, then it can actually take part in optimisation as part of the code generated by Numba for a given function. The NRT initialisation routine could then just take a flag for “enable statistics counters” and LLVM would hopefully just constant-propagate this and optimise out branches that have the counters, depending on the value of that flag.
I think converting the NRT to bitcode lets Numba do this. The allocation functions like
malloc/free
will then be “seen” by LLVM as precisely those calls, which can permit optimisations like converting small constant sizedmalloc
/free
pairs in tight loops into stack allocation.