Row-wise scalar UDFs
See original GitHub issueHello, first off thanks for dask-sql
!
I noticed (and please correct me if I am wrong) that the current process for running UDFs basically works by forwarding the relevant columns of a table directly to that python function as arguments. That is, if I have
def f(x, y):
return x + y
and I write
"SELECT f(col0, col1) FROM my_table"
The impl will pass col0
and col1
as dask series directly as arguments to f
, at which point f
simply runs a binop between those two actual series objects and returns a new series as a result. This implementation works great but has some limitations in the space of functions that a user can define when compared to a “row-wise” UDF. For instance one cannot write functions of this form:
def f(x, y):
if x > 3: # here, type(x) is <class 'dask.dataframe.core.Series'>
return y
else:
return x + 4
# ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all().
I was wondering if there were any thoughts/plans around supporting true row-wise ops, where within the partitions each row is individually sent through the UDF as a set of scalars. Perhaps such an implementation could leverage pd.DataFrame.apply
or similar.
Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Oh! I wasn’t aware that
cuDF
does already optimize theapply
s. Well, that makes things less in-performant then 😃I am with you that this will probably increase usability, especially for new users. Probably that is something we should optimize for. I am just still a bit scared that we make it too easy to do it wrong (because I still assume that many users will have a pandas backend).
How about the following: I am happy if you would like to go ahead and find some convenient way to create those functions. When we add this to the documentation, we make sure that we have some big warning sign 😃 In this case, I am also happy.
Picking up work on this shortly and hopefully a PR to come 😃