Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feat(api): Vector Python UDFs (and UDAFs)

See original GitHub issue

duckdb does not support scalar User Defined Functions written in Python (to be applied one record at a time) but it does expose a vector Python UDF via the map method method:

>>> import pandas as pd
>>> import duckdb
>>> df = pd.DataFrame({"x": range(int(1e4))})
>>> def process_chunk(df_chunk):
...     print(f"processing chunk of size {df_chunk.shape[0]}")
...     return df_chunk * 2
... 
>>> duckdb.from_df(df).map(process_chunk).to_df()
processing chunk of size 0
processing chunk of size 0
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 784
          x
0         0
1         2
2         4
3         6
4         8
...     ...
9995  19990
9996  19992
9997  19994
9998  19996
9999  19998

[10000 rows x 1 columns]

The main motivation for this vector Python UDF API is probably to hide the per-record Python function call overhead. I think it’s a pragmatic API and it would allow to efficiently deploy trained machine learning models for batch scoring in out-of-core manner for instance.

Any chance to expose such vector Python UDFs via the Ibis API?

Also if some backends include or add support Python UDAFs (especially with in parallel via combiners in addtion to mappers and reducers), this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis. As far as I know, duckdb does not expose parallel Python UDAFs unfortunately.

Final side-request: for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

ogriselcommented, Oct 28, 2022

@cpcloud I drafted a proposal for Python UDAFs API in a duckdb issue if you are interested: https://github.com/duckdb/duckdb/discussions/5117.

1reaction

cpcloudcommented, Oct 23, 2022

@ogrisel Thanks for the issue. This is definitely on our radar and we’ll probably start experimenting with support for this in the next month. In fact, the DuckDB folks just pointed us to .map a few weeks ago.

Any chance to expose such vector Python UDFs via the Ibis API?

So, in short, yes there’s a great chance of this happening 😃

this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis

This is an interesting path for us to go down; it’s great to hear a concrete use case for UDAFs since getting them to work well with a nice API and solid performance will be challenging.

for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?

Possibly! I think we’ll need to do some prototyping before we can give a concrete yes or no to this.

Really appreciate all the issues you’re opening, it’s wonderful to get feedback from users ❤️

Top Results From Across the Web

Python UDF Batch API - Snowflake Documentation

The Python UDF batch API enables defining Python functions that receive batches of input rows as Pandas DataFrames and return batches of results...

General User-defined Functions | Apache Flink

NOTE: Python UDF execution requires Python version (3.6, 3.7, 3.8 or 3.9) with PyFlink installed. It's required on both the client side and...

Release Notes - Ibis Project

udf : Vectorized UDF coercion functions are no longer a public API. The minimum supported Python version is now Python 3.8; config: ...

Python UDFs

... Python UDF name with numpy__, such as “numpy__myudf”, argument processing is changed to take advantage of NumPy's fast array-based APIs ...

Applying UDFs on GroupedData in PySpark (with functioning ...

There is currently no way in python to implement a UDAF, ... see: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

feat(api): Vector Python UDFs (and UDAFs)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

bug(duckdb): Can't cast string column to Enum or 'category'

ux: problems with ibis' `_` convenience API for deferred attribute resolution