Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: Reduce overhead of configurable data allocation strategy (NEP49)

See original GitHub issue

Proposed new feature or change:

In NEP49 a configurable allocator has been introduced in numpy (implemented in https://github.com/numpy/numpy/pull/17582). This mechanism introduces some overhead for operations on small arrays and scalars. A benchmark with np.sqrt shows that the overhead can be in the 5-10% range.

Benchmark details

We compare fast_handler_test_compare (numpy main with two performance related PRs included) with fast_handler_test (the same, but with hard-coded allocator)

Benchmark

import numpy as np
import math
import time
from numpy import sqrt
print(np.__version__)

w=np.float64(1.1)
wf=1.1
array=np.random.rand(2)

niter=1_200_000

for kk in range(3):
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(w)
    dt=time.perf_counter()-t0
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(wf)
    dt2=time.perf_counter()-t0
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(array)
    dt3=time.perf_counter()-t0
    print(f'loop {kk}: {dt} {dt2} {dt3}')

Results of fast_handler_test_compare

1.23.0.dev0+1185.gf16125e86
loop 0: 0.7580233269982273 0.7543466200004332 0.5045701469971391
loop 1: 0.7591422369987413 0.7547550320014125 0.5020621660005418
loop 2: 0.7476994270000432 0.7537849910004297 0.5018936799970106

Results of fast_handler_test_compare (allocator overhead removed)

1.23.0.dev0+1186.gbb76538a1
loop 0: 0.6839246829986223 0.6962255100006587 0.4676538419989811
loop 1: 0.6820040509992396 0.6967140100023244 0.468011064996972
loop 2: 0.6811004699993646 0.6971791299984034 0.4678809920005733

The allocator is retrieved for every numpy array or scalar constructed, which matters for small arrays and scalars. The overhead is in two ways:

In methods like PyDataMem_UserNEW the allocator is retrieved via a PyCapsule which performs some run-time checks
In PyDataMem_GetHandler there is a call to PyContextVar_Get which is expensive.

The first item can be addressed by replacing the attribute PyObject *mem_handler in PyArrayObject_fields (which is currently a PyCapsule) by a PyDataMem_Handler*. (unless this is exposed to the public API)

About the second item: the PyContextVar_Get calls _PyThreadState_GET internally. So perhaps the allocator can depend on the thread? Maybe we can introduce a mechanism that skips this part if there is only a single allocator (e.g. when PyDataMem_SetHandler has never been called).

@mattip As the author of NEP49, can you comment on this?

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mattipcommented, May 11, 2022

The capsule is exposed via PyDataMem_GetHandler and PyDataMem_SetHandler. We could contrive a way to reduce the overhead, at the expense of making the code more complicated.
The need for PyContextVar_Get was discussed on the mailing list and summarized in this comment to the PR.

0reactions

sebergcommented, Jun 11, 2022

Going to close this issue for now, seems we have settled on not worrying about this for now. If anyone ever comes back here even though its closed, maybe that will be a reason to reconsider 😉.