Feature request: optimize permutable nested prange
See original GitHub issueFeature request
I have a function with three nested range loops. I optimize it using @njit(parallel=True)
.
The size of the ranges really depends, and some can be trivial (range(1)
).
Doing some testing I observed that the only way to parallelize all three is to use prange on each.
But now, I observe that permuting the three loops changes the performance (my loop sizes are very unequal, one range is ~2000 and the other ~20), which is understandable (I think only the outermost prange is used).
What do you recommend? The best solution would be to support a way for numba to understand that you can permute the loops depending on their size / for cache hit optimization (think generalized prange). Numba could maybe even “test” the order and select the best one, while running (not sure it makes any sense).
Nested loops have this structure:
for i1 in prange(n1):
# code before
for i2 in prange(n2):
# recursive structure
# code after
If there is no code before or after or if it is fast, supporting a permutable version of itertools.product
would be sufficient (like a pproduct).
for i1, i2, i3 in pproduct(range(n1), range(n2), range(n3)):
# code before 1
# code before 2
# code before 3
# code inside
# code after 3
# code after 2
# code after 1
Else, one should ensure that the “code before” in prange(n2)
doesn’t use i1 by computing a dependency graph or providing a syntax for the user to specify it. For example:
l1, l2, l3 = permutable(range(n1), range(n2), range(n3))
for i1 in l1:
# code before
for i2 in l2:
# recursive structure
# code after
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (5 by maintainers)
Thanks for the request. Extracting the code from https://stackoverflow.com/questions/50255126/numba-doesnt-parallelize-range below.
As Numba is using JIT compilation, the timing part of the code above is including the compilation time as well as the execution time in the reported time. For reference, the above gives me:
editing the timing part of the code so that it is just timing execution:
gives:
it doesn’t make much difference as the compute part of the code is quite heavy for the given
n
, but it should always be considered.As to the four variants of the loop and what is observed in the timings:
Function
f1
is a standard CPU compiled triple nested loop.Function
f2
has loops declared withprange
, but noparallel=True
option set innjit
, as a result the compiler seesprange
as an alias of range.Function
f3
hasparallel=True
so analysis for transforming the code to execute in parallel will happen, however, the analysis correctly decides that there is nothing to parallelize (no loop was declared as parallel withprange
). The example referred to in the parallel documentation that containsrange
is parallelizing the loop body which contains many computations on arrays, it is these that are fused and transformed into a parallel region. In the example it is also worth nothing thatw
is a loop carried dependency (iterationi+1
needs the result from iterationi
) and so an embarrassingly parallel loop execution is not possible.Function
f4
hasparallel=True
and the loops are all declared withprange
, this allows analysis for transforming the code to execute in parallel and as there are explicitly declared parallel loops that are suitable for parallel transformation the transform is done and the code runs more quickly.Declaring
prange
on inner loops when there is an outer loopprange
translates to the inner ones being run asrange
loops, this prevents nested parallelism and also makes it such that larger work blocks are available per thread.The information about what
parallel=True
is doing can be found by setting the environment variableNUMBA_DEBUG_ARRAY_OPT_STATS
, with this set the terminal states:Which shows
f3
has no parfor transform, andf4
has 3 parallel loops identified fromprange
and it fuses them into a single loop (loop 2, the outer one).In answer to your feature request… at present Numba compiles code based on the type of the arguments and not the values, it also compiles everything to machine code upfront and dispatches to compiled code based purely on type. The behaviour described in the feature request requires analysis based on run time values and so is more amenable to a tracing JIT which could feasibly analyse a loop nest instance at run time and perform dynamic loop nest optimisations. However, this is out of the scope of what Numba can do at present, I would think https://github.com/numba/numba/issues/2949 will help towards being able to achieve this though.
This all said, I think you will get loop nest optimisation from the LLVM backed via e.g. loop switching if you use loop bounds that are fixed at compile time, e.g. if you declare your loops having a fixed sized like
range(20)
.Closing this question as it seems to be resolved. Numba now has a discourse forum https://numba.discourse.group/ which is great for questions like this, please do consider posting there in future 😃 Thanks!