feat(api): Vector Python UDFs (and UDAFs)
See original GitHub issueduckdb does not support scalar User Defined Functions written in Python (to be applied one record at a time) but it does expose a vector Python UDF via the map
method method:
>>> import pandas as pd
>>> import duckdb
>>> df = pd.DataFrame({"x": range(int(1e4))})
>>> def process_chunk(df_chunk):
... print(f"processing chunk of size {df_chunk.shape[0]}")
... return df_chunk * 2
...
>>> duckdb.from_df(df).map(process_chunk).to_df()
processing chunk of size 0
processing chunk of size 0
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 784
x
0 0
1 2
2 4
3 6
4 8
... ...
9995 19990
9996 19992
9997 19994
9998 19996
9999 19998
[10000 rows x 1 columns]
The main motivation for this vector Python UDF API is probably to hide the per-record Python function call overhead. I think it’s a pragmatic API and it would allow to efficiently deploy trained machine learning models for batch scoring in out-of-core manner for instance.
Any chance to expose such vector Python UDFs via the Ibis API?
Also if some backends include or add support Python UDAFs (especially with in parallel via combiners in addtion to mappers and reducers), this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis. As far as I know, duckdb does not expose parallel Python UDAFs unfortunately.
Final side-request: for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
@cpcloud I drafted a proposal for Python UDAFs API in a duckdb issue if you are interested: https://github.com/duckdb/duckdb/discussions/5117.
@ogrisel Thanks for the issue. This is definitely on our radar and we’ll probably start experimenting with support for this in the next month. In fact, the DuckDB folks just pointed us to
.map
a few weeks ago.So, in short, yes there’s a great chance of this happening 😃
This is an interesting path for us to go down; it’s great to hear a concrete use case for UDAFs since getting them to work well with a nice API and solid performance will be challenging.
Possibly! I think we’ll need to do some prototyping before we can give a concrete yes or no to this.
Really appreciate all the issues you’re opening, it’s wonderful to get feedback from users ❤️