Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support inlining C and C++ (or even LLVM IR) code into nopython-jitted function/class

See original GitHub issue

Feature request

Would be great if it was possible to inline regular C or C++ code into nopython-jitted function. Something like following:

@numba.njit
def f(a):
    c_funcs = numba.c_func("""
        inline int add(int a, int b) { return a + b; }
        inline int mul(int a, int b) { return a * b; }
    """)
    b = 3
    for i in range(5):
        a = c_funcs.mul(c_funcs.add(a, a), b)
    return a

Main idea here that C functions (add, mul) code should be inlined into f() and optimized by LLVM.

Of cause there is CFFI support that allows to compile any C functions as .pyd module and then use them inside njited-function. But drawback that these C functions are not inlined (they are called by address) into njited code hence not optmized by LLVM as a whole.

I think there should be some way to mix Numba Python’s code and C/C++ code directly, because not everything can be done in pure Python.

For example if I want to do multiplication of u64 x u64 -> u128 then there is no such single-instruction operation in Python and Numba, while in C/C++ it can be done by unsigned __int128 c = uint64_t(a) * uint64_t(b); in Clang or uint64_t hi, lo = _umul128(a, b, &hi); in MSVC. Which results in single Assembler mul instruction using several CPU cycles. In python you can’t do this as one CPU instruction.

Of course one can make array u128-multiplication C function using CFFI then non-inlined function call overhead will be small. But it is not always possible to act on whole array - for example I want to implement jitclass that emulates u128 and use this u128 class everywhere for single-value variables in some njitted mathematical code, where there is no work on array at all.

Another use-case is to implement jitclass that emulates BigInteger so that BigInteger (similar like python’s int) will be available in nopython-function. Of cause efficient single-value (non-array) BigInteger is not possible to implement without inlineable C/C++ functions.

Why C/C++ inlining is crucial? Because it often happens that Numba’s Python is lacking some operation and in C/C++ (or even Assembler) this operation can be done as 1-3 CPU instructions. Non-inlined function call that does 1-3 instructions will have too huge overhead.

Also as Numba is LLVM-based then would be great to also be possible to inline LLVM IR (LLVM Intermediate Representation). Or other kind of Assembler-like language. Because when Python code is jitted of cause it is converted to LLVM IR at some point, hence inlining one LLVM IR into another one looks like natural thing.

Inlining LLVM IR will allow anybody to inline any-language code. For example you don’t support Rust. But Rust developer can compile Rust to LLVM IR (I think Rust is supported by some fork (or official) of LLVM) and then inline this LLVM IR into your nopython-jitted code. Hence LLVM IR inlining will allow to support any possible language that is based on LLVM.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

stuartarchibaldcommented, Sep 28, 2021

@polkovnikov here’s an example of how to do what’s in the OP, the other cases you have mentioned are simplifications of this. I hope at some point to extract some useful parts into Numba’s public extension API (the part about linking in some bitcode). The thing I’ve not sorted out yet in this example is the forcible inlining of the functions defined in the C source.

from numba import njit, types, literally
from numba.extending import overload, intrinsic
from numba.core import cgutils
import numpy as np
import subprocess
import tempfile
import llvmlite.binding as llvm
from llvmlite import ir
from collections import namedtuple, OrderedDict

def compile_cfunc(string):
    pass

@overload(compile_cfunc)
def ol_compile_cfunc(string, sigs):
    # invoke clang
    if not isinstance(string, types.Literal):
        def impl(string, sigs):
            literally(string)
        return impl
    c_src = string.literal_value
    sig_map = sigs.initial_value

    c_module = None
    # compile the C source
    with tempfile.TemporaryDirectory() as tmpdir:
        with tempfile.NamedTemporaryFile(mode='wt',
                                         encoding='ascii',
                                         dir=tmpdir,
                                         suffix='.c') as c_src_file:
            c_src_file.write(c_src)
            c_src_file.flush()
            cmd = 'clang -emit-llvm -c'.split(' ')
            bc_file = c_src_file.name.replace('.c','.bc')
            subprocess.run(cmd + [c_src_file.name, '-o', bc_file])
            with open(bc_file, 'rb') as bc:
                bc_bytes = bc.read()
            c_module = llvm.parse_bitcode(bc_bytes)

    assert c_module is not None, "Failed to compile C code"
    c_module.verify()

    # create an ordered map of C function name to signature based on the sig map
    # this is important as the struct member name generation order needs to
    # match up with whats been generated
    funcs = [f for f in c_module.functions]
    sigs = OrderedDict()
    for func in funcs:
        assert func.name in sig_map
        sigs[func.name] = sig_map[func.name]

    # add the C module to the code library
    @intrinsic
    def add_to_ee(tyctx,):
        sig = types.none()
        def codegen(cgctx, builder, sig, llargs):
            cgctx.active_code_library.add_llvm_module(c_module)
        return sig, codegen

    # this a dynamically tuple pretending to be a struct
    c_struct = namedtuple('c_struct', [*sigs.keys()])

    # generate dispatcher stubs
    dispatchers = []
    for fname in sigs.keys():
        def gen(fname=fname):
            tysig = sigs[fname]
            sigty = eval(tysig, {}, types.__dict__)
            @intrinsic
            def gen_call(tyctx, arg):
                sig = sigty.return_type(arg)
                # make sure the incoming args match the declared
                declared_args = sigty.args
                presented_arg = arg
                if isinstance(presented_arg, types.containers._StarArgTupleMixin):
                    assert presented_arg.types == declared_args
                else:
                    assert 0, 'unreachable'

                def codegen(cgctx, builder, sig, llargs):
                    stararg = llargs[0]
                    tupl = cgutils.unpack_tuple(builder, stararg)
                    mod = builder.module
                    ll_arg_tys = [cgctx.get_value_type(x) for x in sigty.args]
                    ll_retty = cgctx.get_value_type(sigty.return_type)
                    ll_sig_ty = ir.FunctionType(ll_retty, ll_arg_tys)
                    fn = cgutils.get_or_insert_function(mod, ll_sig_ty, fname)
                    return builder.call(fn, tupl)
                return sig, codegen

            @njit(inline='always')
            def fncall(*args):
                return gen_call(args)

            return fncall

        dispatchers.append(gen(fname=fname))

    # create the struct instance
    c_struct_inst = c_struct(*dispatchers)

    # return this trivial function, it forces the C code module into the EE
    # and returns the c_struct containing the dispatchers from globals
    def impl(string, sigs):
        add_to_ee()
        return c_struct_inst
    return impl


@njit
def f(a):
    c_funcs = compile_cfunc("""
    extern int add(int a, int b) { return a + b; }
    extern int mul(int a, int b) { return a * b; }
    extern double fmadd(double a, double b, double c) { return a + (b * c); }
    extern double mixed_fmadd(int a, int b, double c) { return a + (b * c); }
    """,
    {'add': 'intp(intp, intp)',
     'mul': 'intp(intp, intp)',
     'fmadd': 'double(double, double, double)',
     'mixed_fmadd': 'double(intp, intp, double)'})
    b = 7
    for i in range(5):
        a = c_funcs.mul(c_funcs.add(a, a), b)
    x = c_funcs.fmadd(np.float64(a), np.float64(b), 11.)
    y = c_funcs.mixed_fmadd(a, b, 11.)
    return a, x, y

got = f(3)


def g(a):
    b = 7
    def mul(x, y):
        return x * y
    def add(x, y):
        return x + y
    def fmadd(p, q, r):
        return p + (q * r)
    for i in range(5):
        a = mul(add(a, a), b)
    return a, fmadd(a, b, 11.), fmadd(a, b, 11.)

expected = g(3)

print(f"got: {got}, expected: {expected}. OK={got==expected}")
assert got == expected

1reaction

stuartarchibaldcommented, Sep 14, 2021

@stuartarchibald Looking at your example here. If I understand correctly, this example compiles IR and wraps it into ctypes function cfunc(a, b).

If I use such cfunc() from Numba’s njitted-function then I think this function is NOT INLINED into njitted-function, right?

Correct, this example does do that. But in your case, you’d not access the function via ctypes, you’d just generate a call to it using an @intrinsic https://numba.readthedocs.io/en/stable/extending/high-level.html#implementing-intrinsics.

By inlining I mean not just placing a call instruction into njitted function, but actually compiling whole njitted function together with LLVM IR of cfunc() as a whole. Same what inline modifier does in C/C++. Whole njitted-function IR should be mixed with inlined cfunc()'s IR and then optimized together using LLVM, that I mean by inlining.

This is understood, and I think possible.

Your example link just does a + b. If you use such function in some heavy-computational loop then doing extra call instruction on each a + b is a huge overhead. More than that not just a call instruction is an overhead. C++ compilers do a lot of things when inlining inline function, do different propagation of registers and bit magic. So inline-ed C++ code sometimes is even 5-10 times faster than non-inlined.

Yes, this is why you need to compile the external source to bitcode/LLVM IR and add that module to the library that Numba is generating code into such that it can be all linked together and inlining/many other related optimisations take place.

Same with my suggestion about C++ above - of cause non-inlined C/C++ code I can already create and call through ctypes/cffi (maybe even Cython). The only reason I suggested my C++ proposal above is because I wanted to have not the handy ability of compiling C++ as a Python string, but ability to use great LLVM optimizer to inline tiny functions like a + b and not to have call instruction overhead.

Same of LLVM IR - I want not just an ability to somehow compile/call IR bitcode, but actually to do all optimization of inlining, same like inline functions optimized in C++.

Basically my proposal above is only related to speed optimization. Without considering run speed of code I can find different ways how to compile C/C++/Asm/LLVM-IR into .pyd and call functions of this module from njitted-function. But I wanted to achieve speed of my code.

I’ve got an example of how to do all this but have one more thing to work out prior to sharing it.

The conclusion from the Numba meeting was that it is probably not something Numba can support directly due to the complexity of ensuring valid compilers/LLVM IR versions/type system behaviours etc. However, some of the parts needed to actually implement this could well be abstracted as something that Numba could support, for example, linking in an external bitcode source.