Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UDF usecase

See original GitHub issue

Hi,

I have a usecase where I need to write a UDF to decrypt a raw value. So for instance, the query would look like:

my_table:

  id(int) |  encrypted_value(varchar)  |  ...
      1   |  x3vsdf.sdsdf.sdfs3.kjdfkL |  ...
      2 ...

 SELECT id, decrypt_udf(encrypted_value) FROM my_table

The tricky part is decrypt_udf, requires contacting an external service (an RPC sending encrypted_value and receving decrypted value). Therefore, I’d need to initialize a connection when the plugin is loaded and persist the connection, so that we don’t need to pay the cost of connection initialization each time a decrypt_udf is called by a worker.

I see the simple scalar UDFs are all implemented as static functions, is it correct to assume our statically initialized connection will be reused whenever decrypt_udf is called? basically I am trying to figure out where to keep my connection and do the initialization so that it’s not repeated each time.

Last question, obviously it’s much more efficient to batch and send multiple decrypt requests in one RPC as opposed to do it multiple times. Imagine, my query:

 SELECT id, decrypt_udf(encrypted_value) FROM my_table LIMIT 100

This means decrypt_udf will be called 100 times. Do presto workers call decrypt_udf sequentially or is there any concurrency going on? if they are called concurrently I can keep them in memory, batch them and send them all together (block decrypt_udf, till the result is read). If not, what’s the proper way to batch these calls? is UDF even the right approach?

Thank you so much for your help.

Issue Analytics

State:
Created 4 years ago
Comments:15 (7 by maintainers)

Top GitHub Comments

2reactions

rongrongcommented, Apr 16, 2019

In general we strongly discourage UDFs to contact external services. Scalar functions are generally assumed to be quick and efficient. Having RPC calls is very far from that assumption. And the implications of slow UDFs are not well tested in Presto. One way to work around this is to use the thrift connector. So instead of using decrypt_udf(encrypted_value), you can write something like

SELECT id, decrypted_value
FROM my_table t
JOIN decrypt_service d
on t.encrypted_value = d.encrypted_value

and decrypt_service is a table backed by thrift connector.

We are exploring ideas on supporting external service udfs but that’s probably late 2019 or 2020 effort.

1reaction

cemcayiroglucommented, Apr 17, 2019

@bshafiee Not yet. I am going to create one when I have a more concrete idea about the design.

@rongrong good point. But I think we should have the option for both deployment models. Generally speaking, we need to think about data locality. Thrift server can be colocated to workers or in the same RAC.

Top Results From Across the Web

Example uses of user-defined functions (UDFs)

Example uses of user-defined functions (UDFs) · Accessing external components using Amazon Redshift Lambda UDFs · Translate and analyze text using SQL functions ......

Introducing SQL User-Defined Functions - Databricks

In this blog, we will walk you through some key use cases of SQL UDFs with examples.

UDFs (User-Defined Functions) - Snowflake Documentation

Java: A Java UDF lets you use the Java programming language to manipulate data and return either scalar or tabular results. JavaScript: A...

User Defined Function (UDF) - Data Engineering Glossary

For example, a complex calculation can be programmed using SQL and stored as a UDF. When this calculation needs to be used in...

User-Defined Functions (UDFs) - Apache Impala

Depending on your use case, you might write all-new functions, reuse Java UDFs that you have already written for Hive, or port Hive...