Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Row-wise scalar UDFs

See original GitHub issue

Hello, first off thanks for dask-sql!

I noticed (and please correct me if I am wrong) that the current process for running UDFs basically works by forwarding the relevant columns of a table directly to that python function as arguments. That is, if I have

def f(x, y):
    return x + y

and I write

"SELECT f(col0, col1) FROM my_table"

The impl will pass col0 and col1 as dask series directly as arguments to f, at which point f simply runs a binop between those two actual series objects and returns a new series as a result. This implementation works great but has some limitations in the space of functions that a user can define when compared to a “row-wise” UDF. For instance one cannot write functions of this form:

def f(x, y):
    if x > 3: # here, type(x) is <class 'dask.dataframe.core.Series'>
        return y
    else:
        return x + 4

# ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all().

I was wondering if there were any thoughts/plans around supporting true row-wise ops, where within the partitions each row is individually sent through the UDF as a set of scalars. Perhaps such an implementation could leverage pd.DataFrame.apply or similar.

Thanks!

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

nils-brauncommented, Aug 30, 2021

Oh! I wasn’t aware that cuDF does already optimize the applys. Well, that makes things less in-performant then 😃

I am with you that this will probably increase usability, especially for new users. Probably that is something we should optimize for. I am just still a bit scared that we make it too easy to do it wrong (because I still assume that many users will have a pandas backend).

How about the following: I am happy if you would like to go ahead and find some convenient way to create those functions. When we add this to the documentation, we make sure that we have some big warning sign 😃 In this case, I am also happy.

1reaction

brandon-b-millercommented, Sep 16, 2021

Picking up work on this shortly and hopefully a PR to come 😃