UDF usecase
See original GitHub issueHi,
I have a usecase where I need to write a UDF to decrypt a raw value. So for instance, the query would look like:
my_table:
id(int) | encrypted_value(varchar) | ...
1 | x3vsdf.sdsdf.sdfs3.kjdfkL | ...
2 ...
SELECT id, decrypt_udf(encrypted_value) FROM my_table
The tricky part is decrypt_udf
, requires contacting an external service (an RPC sending encrypted_value
and receving decrypted value). Therefore, I’d need to initialize a connection when the plugin is loaded and persist the connection, so that we don’t need to pay the cost of connection initialization each time a decrypt_udf
is called by a worker.
I see the simple scalar UDFs are all implemented as static functions, is it correct to assume our statically initialized connection will be reused whenever decrypt_udf
is called? basically I am trying to figure out where to keep my connection and do the initialization so that it’s not repeated each time.
Last question, obviously it’s much more efficient to batch and send multiple decrypt requests in one RPC as opposed to do it multiple times. Imagine, my query:
SELECT id, decrypt_udf(encrypted_value) FROM my_table LIMIT 100
This means decrypt_udf
will be called 100 times. Do presto workers call decrypt_udf
sequentially or is there any concurrency going on? if they are called concurrently I can keep them in memory, batch them and send them all together (block decrypt_udf
, till the result is read). If not, what’s the proper way to batch these calls? is UDF even the right approach?
Thank you so much for your help.
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (7 by maintainers)
Top GitHub Comments
In general we strongly discourage UDFs to contact external services. Scalar functions are generally assumed to be quick and efficient. Having RPC calls is very far from that assumption. And the implications of slow UDFs are not well tested in Presto. One way to work around this is to use the thrift connector. So instead of using
decrypt_udf(encrypted_value)
, you can write something likeand
decrypt_service
is a table backed by thrift connector.We are exploring ideas on supporting external service udfs but that’s probably late 2019 or 2020 effort.
@bshafiee Not yet. I am going to create one when I have a more concrete idea about the design.
@rongrong good point. But I think we should have the option for both deployment models. Generally speaking, we need to think about data locality. Thrift server can be colocated to workers or in the same RAC.