question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Row-wise scalar UDFs

See original GitHub issue

Hello, first off thanks for dask-sql!

I noticed (and please correct me if I am wrong) that the current process for running UDFs basically works by forwarding the relevant columns of a table directly to that python function as arguments. That is, if I have

def f(x, y):
    return x + y

and I write

"SELECT f(col0, col1) FROM my_table"

The impl will pass col0 and col1 as dask series directly as arguments to f, at which point f simply runs a binop between those two actual series objects and returns a new series as a result. This implementation works great but has some limitations in the space of functions that a user can define when compared to a “row-wise” UDF. For instance one cannot write functions of this form:

def f(x, y):
    if x > 3: # here, type(x) is <class 'dask.dataframe.core.Series'>
        return y
    else:
        return x + 4

# ValueError: The truth value of a Series is ambiguous. Use a.any() or a.all().

I was wondering if there were any thoughts/plans around supporting true row-wise ops, where within the partitions each row is individually sent through the UDF as a set of scalars. Perhaps such an implementation could leverage pd.DataFrame.apply or similar.

Thanks!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
nils-brauncommented, Aug 30, 2021

Oh! I wasn’t aware that cuDF does already optimize the applys. Well, that makes things less in-performant then 😃

I am with you that this will probably increase usability, especially for new users. Probably that is something we should optimize for. I am just still a bit scared that we make it too easy to do it wrong (because I still assume that many users will have a pandas backend).

How about the following: I am happy if you would like to go ahead and find some convenient way to create those functions. When we add this to the documentation, we make sure that we have some big warning sign 😃 In this case, I am also happy.

1reaction
brandon-b-millercommented, Sep 16, 2021

Picking up work on this shortly and hopefully a PR to come 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to print a table row wise in SQL Server? - MSDN - Microsoft
I have created a function that outputs the table of a number, Here is my UDF scalar function: create function fntable(@a int) returns...
Read more >
Custom Functions and Aggregations — dask-sql documentation
Scalar functions can have one or more input parameters and can combine columns and literal values. Row-Wise Pandas UDFs¶. In some cases it...
Read more >
User-defined functions - IBM
A user-defined function can be a scalar function, which returns a single value each time it is called; an aggregate function, which is...
Read more >
How to print a table row wise in SQL Server? - Stack Overflow
I have created a function that outputs the table of a number, Here is my UDF scalar function: create function fntable(@a int) returns ......
Read more >
How to create and use UDFs
A scalar UDF computes on one input row per UDF instance and returns one output row. It is automatically executed in massively parallel...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found