question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feat(api): Vector Python UDFs (and UDAFs)

See original GitHub issue

duckdb does not support scalar User Defined Functions written in Python (to be applied one record at a time) but it does expose a vector Python UDF via the map method method:

>>> import pandas as pd
>>> import duckdb
>>> df = pd.DataFrame({"x": range(int(1e4))})
>>> def process_chunk(df_chunk):
...     print(f"processing chunk of size {df_chunk.shape[0]}")
...     return df_chunk * 2
... 
>>> duckdb.from_df(df).map(process_chunk).to_df()
processing chunk of size 0
processing chunk of size 0
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 784
          x
0         0
1         2
2         4
3         6
4         8
...     ...
9995  19990
9996  19992
9997  19994
9998  19996
9999  19998

[10000 rows x 1 columns]

The main motivation for this vector Python UDF API is probably to hide the per-record Python function call overhead. I think it’s a pragmatic API and it would allow to efficiently deploy trained machine learning models for batch scoring in out-of-core manner for instance.

Any chance to expose such vector Python UDFs via the Ibis API?

Also if some backends include or add support Python UDAFs (especially with in parallel via combiners in addtion to mappers and reducers), this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis. As far as I know, duckdb does not expose parallel Python UDAFs unfortunately.

Final side-request: for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ogriselcommented, Oct 28, 2022

@cpcloud I drafted a proposal for Python UDAFs API in a duckdb issue if you are interested: https://github.com/duckdb/duckdb/discussions/5117.

1reaction
cpcloudcommented, Oct 23, 2022

@ogrisel Thanks for the issue. This is definitely on our radar and we’ll probably start experimenting with support for this in the next month. In fact, the DuckDB folks just pointed us to .map a few weeks ago.

Any chance to expose such vector Python UDFs via the Ibis API?

So, in short, yes there’s a great chance of this happening 😃

this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis

This is an interesting path for us to go down; it’s great to hear a concrete use case for UDAFs since getting them to work well with a nice API and solid performance will be challenging.

for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?

Possibly! I think we’ll need to do some prototyping before we can give a concrete yes or no to this.

Really appreciate all the issues you’re opening, it’s wonderful to get feedback from users ❤️

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python UDF Batch API - Snowflake Documentation
The Python UDF batch API enables defining Python functions that receive batches of input rows as Pandas DataFrames and return batches of results...
Read more >
General User-defined Functions | Apache Flink
NOTE: Python UDF execution requires Python version (3.6, 3.7, 3.8 or 3.9) with PyFlink installed. It's required on both the client side and...
Read more >
Release Notes - Ibis Project
udf : Vectorized UDF coercion functions are no longer a public API. The minimum supported Python version is now Python 3.8; config: ...
Read more >
Python UDFs
... Python UDF name with numpy__, such as “numpy__myudf”, argument processing is changed to take advantage of NumPy's fast array-based APIs ...
Read more >
Applying UDFs on GroupedData in PySpark (with functioning ...
There is currently no way in python to implement a UDAF, ... see: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found