question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make NRT stats counters optional, off by default

See original GitHub issue

Reporting a bug

  • I have tried using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/main/CHANGE_LOG).
  • I have included a self contained code sample to reproduce the problem. i.e. it’s possible to run as ‘python bug.py’.

As discussed at last week’s public Numba meeting, https://github.com/numba/numba/wiki/Minutes_2022_06_07. Code where:

  • there are potentially quite a few allocations/deallocations in a function, often caused by array temporaries

and

  • the function is called by many native threads

results in a lot of pressure on the atomic stats counters defined here: https://github.com/numba/numba/blob/080633aa333e69781888eee6521f5e2fa6751c75/numba/core/runtime/nrt.c#L53 and their use as is commented out here:

diff --git a/numba/core/runtime/nrt.c b/numba/core/runtime/nrt.c
index 3a65c9b..213251a 100644
--- a/numba/core/runtime/nrt.c
+++ b/numba/core/runtime/nrt.c
@@ -182,7 +182,7 @@ void NRT_MemInfo_init(NRT_MemInfo *mi,void *data, size_t size,
     mi->external_allocator = external_allocator;
     NRT_Debug(nrt_debug_print("NRT_MemInfo_init mi=%p external_allocator=%p\n", mi, external_allocator));
     /* Update stats */
-    TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
+//     TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
 }
 
 NRT_MemInfo *NRT_MemInfo_new(void *data, size_t size,
@@ -321,7 +321,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
     NRT_Debug(nrt_debug_print("NRT_dealloc meminfo: %p external_allocator: %p\n", mi, mi->external_allocator));
     if (mi->external_allocator) {
         mi->external_allocator->free(mi, mi->external_allocator->opaque_data);
-        TheMSys.atomic_inc(&TheMSys.stats_free);
+//         TheMSys.atomic_inc(&TheMSys.stats_free);
     } else {
         NRT_Free(mi);
     }
@@ -329,7 +329,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
 
 void NRT_MemInfo_destroy(NRT_MemInfo *mi) {
     NRT_dealloc(mi);
-    TheMSys.atomic_inc(&TheMSys.stats_mi_free);
+//     TheMSys.atomic_inc(&TheMSys.stats_mi_free);
 }
 
 void NRT_MemInfo_acquire(NRT_MemInfo *mi) {
@@ -472,7 +472,7 @@ void* NRT_Allocate_External(size_t size, NRT_ExternalAllocator *allocator) {
         ptr = TheMSys.allocator.malloc(size);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External bytes=%zu ptr=%p\n", size, ptr));
     }
-    TheMSys.atomic_inc(&TheMSys.stats_alloc);
+//     TheMSys.atomic_inc(&TheMSys.stats_alloc);
     return ptr;
 }
 
@@ -486,7 +486,7 @@ void *NRT_Reallocate(void *ptr, size_t size) {
 void NRT_Free(void *ptr) {
     NRT_Debug(nrt_debug_print("NRT_Free %p\n", ptr));
     TheMSys.allocator.free(ptr);
-    TheMSys.atomic_inc(&TheMSys.stats_free);
+//     TheMSys.atomic_inc(&TheMSys.stats_free);
 }

I think this code that follows reproduces the situation described above (and at the public meeting) that applies pressure on the atomic counters (using the count_muons example as given at the public meeting):

import ctypes
import numba
from numba import cfunc, carray, types, njit, prange
from numba.extending import intrinsic
import numpy as np
from timeit import default_timer


@cfunc(types.intp(types.CPointer(types.float32)))
def count_muons(ptr):
    arr = carray(ptr, 10)
    return np.count_nonzero((arr > 1.) & (np.abs(arr) < 7.) & (arr > 0.))


x = np.arange(10, dtype=np.float32)

count = count_muons(x.ctypes.data_as(ctypes.POINTER(ctypes.c_float)))
print("cfunc call", count)


@intrinsic
def addr_as_float32ptr(tyctx, addr):
    sig = types.CPointer(types.float32)(types.intp)

    def codegen(cgctx, builder, sig, llargs):
        ret = builder.inttoptr(llargs[0], cgctx.get_value_type(sig.return_type))
        return ret
    return sig, codegen


@njit
def liveness_sink(*args):
    # keeps things that are only referenced by pointers alive
    pass


@njit(parallel=True)
def foo(n):
    acc0 = acc1 = acc2 = acc3 = 0
    for i in prange(n):
        input_arr = np.arange(10).astype(np.float32)
        addr = input_arr.ctypes.data
        ptr = addr_as_float32ptr(addr)
        acc0 += count_muons(ptr)
        acc1 += count_muons(ptr)
        acc2 += count_muons(ptr)
        acc3 += count_muons(ptr)
        liveness_sink(input_arr)
    return acc0 + acc1 + acc2 + acc3


# make this bigger to run more arrays
n = 4000000

tstart = default_timer()
out = foo(n)
tend = default_timer()
print("jit call", out)
assert out == n * 5 * 4 # n loops with 5 things counted in 4 accumulators

print(f"Elapsed time: {tend-tstart}")

Without the patch above the result is:

Elapsed time: 19.056538556000305

but with the patch above (i.e. atomic counters are disabled)

Elapsed time: 6.221216844001901

It will probably be easier to fix this after https://github.com/numba/numba/pull/8106 is merged.

CC @jpivarski

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:20 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
stuartarchibaldcommented, Jul 8, 2022

Am hoping once #8235 is merged, this will be fixed. It is the final PR in a series of patches that have:

  • Made it so that the use of “debug” markers by memseting known bytes into allocations/deallocations is only on in a specific NRT debug mode. This patch also made it so that “slow” alignment checks were only used if the alignment requirement is not a power of 2. (#8200)
  • Moved the NRT to C++ and made it use <atomic> so as to simplify the code used to perform atomic operations, this also removed the injection of LLVM compiled function pointers at runtime (#8106)!
  • Switched off the atomic NRT allocation statistics counters by default, they can be switched back on with an environment variable (as used in Numba’s testing) (#8235).

All of these patches combined lead to the original MWR: https://github.com/numba/numba/issues/8156#issue-1269644106 running at a much improved 6.1s locally (was originally 19s). There should now be no pressure caused by waiting on atomic locks in the NRT or superfluous work being performed in the same, but debug capacity is still present through environment variables. Hope this helps!

1reaction
stuartarchibaldcommented, Jun 29, 2022

Sounds great!

Thanks for the input @aseyboldt.

I’m not entirely sure I understand how you do it yet. What do you mean by “the point of NRT initialization”? When that library is loaded by the dynamic linker the first time?

When Numba first compiles something needing the NRT, the NRT is initialised and the NRT dynamic library symbols are wired in. The code path starts here: https://github.com/numba/numba/blob/96224dd2480b9ad6d115be84decc8041e31ed55b/numba/core/runtime/nrt.py#L15


My understanding of how this could work would be something like this:

* When the package is installed (or the wheel is created, in setup.py) we create a shared library with symbols for allocating new objects. There could be two versions of those functions, one debugging version that increments counters and one that doesn't.

* At runtime of the numba library, when a numba function is compiled, we check if currently the `NUMBA_NRT_DEBUG` flag is set, and based on that we choose if we want to use the `_debug` version of the allocation functions when generating code.

* At runtime of the jited function it doesn't matter anymore if the `NUMBA_NRT_DEBUG` flag is set, it just uses the code that was generated in the previous step.

This is somewhere along the lines of what I’m considering, but long term I’m considering going much further as Numba has the LLVM JIT compiler at its disposal. Were what is the “NRT dynamic shared library” not a compiled library but actually just LLVM bitcode, then it can actually take part in optimisation as part of the code generated by Numba for a given function. The NRT initialisation routine could then just take a flag for “enable statistics counters” and LLVM would hopefully just constant-propagate this and optimise out branches that have the counters, depending on the value of that flag.


I guess you also thought about getting rid of the shared nrt library all together, so that the allocation functions are statically linked and can be seen by the optimizer? (But maybe only when NUMBA_NRT_DEBUG isn’t set?)

I think converting the NRT to bitcode lets Numba do this. The allocation functions like malloc/free will then be “seen” by LLVM as precisely those calls, which can permit optimisations like converting small constant sized malloc/free pairs in tight loops into stack allocation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

EventCounters in .NET Core - Microsoft Learn
This article focuses on the cross-platform capabilities of EventCounters, and intentionally excludes PerfView and ETW (Event Tracing for Windows) ...
Read more >
impstats: Generate Periodic Statistics of Internal Counters
When set to “on”, counters are automatically reset after they are emitted. In that case, the contain only deltas to the last value...
Read more >
Cisco Secure Firewall Threat Defense Command Reference
This command clears the statistics displayed with the show failover statistics command and the counters in the Stateful Failover Logical Update ...
Read more >
synopsis - jemalloc
This option is disabled by default unless discarding virtual memory is known ... Resets the counter for net bytes allocated in the calling...
Read more >
Amazon CloudWatch concepts - AWS Documentation
By default, many AWS services provide metrics at no charge for resources (such ... data for the specified metric to create the statistic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found