ENH: Reduce overhead of configurable data allocation strategy (NEP49)
See original GitHub issueProposed new feature or change:
In NEP49 a configurable allocator has been introduced in numpy (implemented in https://github.com/numpy/numpy/pull/17582). This mechanism introduces some overhead for operations on small arrays and scalars. A benchmark with np.sqrt
shows that the overhead can be in the 5-10% range.
Benchmark details
We compare fast_handler_test_compare (numpy main with two performance related PRs included) with fast_handler_test (the same, but with hard-coded allocator)
Benchmark
import numpy as np
import math
import time
from numpy import sqrt
print(np.__version__)
w=np.float64(1.1)
wf=1.1
array=np.random.rand(2)
niter=1_200_000
for kk in range(3):
t0=time.perf_counter()
for ii in range(niter):
_=sqrt(w)
dt=time.perf_counter()-t0
t0=time.perf_counter()
for ii in range(niter):
_=sqrt(wf)
dt2=time.perf_counter()-t0
t0=time.perf_counter()
for ii in range(niter):
_=sqrt(array)
dt3=time.perf_counter()-t0
print(f'loop {kk}: {dt} {dt2} {dt3}')
Results of fast_handler_test_compare
1.23.0.dev0+1185.gf16125e86
loop 0: 0.7580233269982273 0.7543466200004332 0.5045701469971391
loop 1: 0.7591422369987413 0.7547550320014125 0.5020621660005418
loop 2: 0.7476994270000432 0.7537849910004297 0.5018936799970106
Results of fast_handler_test_compare
(allocator overhead removed)
1.23.0.dev0+1186.gbb76538a1
loop 0: 0.6839246829986223 0.6962255100006587 0.4676538419989811
loop 1: 0.6820040509992396 0.6967140100023244 0.468011064996972
loop 2: 0.6811004699993646 0.6971791299984034 0.4678809920005733
The allocator is retrieved for every numpy array or scalar constructed, which matters for small arrays and scalars. The overhead is in two ways:
- In methods like
PyDataMem_UserNEW
the allocator is retrieved via aPyCapsule
which performs some run-time checks - In
PyDataMem_GetHandler
there is a call toPyContextVar_Get
which is expensive.
The first item can be addressed by replacing the attribute PyObject *mem_handler
in PyArrayObject_fields (which is currently a PyCapsule
) by a PyDataMem_Handler*
. (unless this is exposed to the public API)
About the second item: the PyContextVar_Get
calls _PyThreadState_GET
internally. So perhaps the allocator can depend on the thread? Maybe we can introduce a mechanism that skips this part if there is only a single allocator (e.g. when PyDataMem_SetHandler
has never been called).
@mattip As the author of NEP49, can you comment on this?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
The capsule is exposed via
PyDataMem_GetHandler
andPyDataMem_SetHandler
. We could contrive a way to reduce the overhead, at the expense of making the code more complicated.The need for
PyContextVar_Get
was discussed on the mailing list and summarized in this comment to the PR.Going to close this issue for now, seems we have settled on not worrying about this for now. If anyone ever comes back here even though its closed, maybe that will be a reason to reconsider 😉.