question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi,

I have a usecase where I need to write a UDF to decrypt a raw value. So for instance, the query would look like:

my_table:

  id(int) |  encrypted_value(varchar)  |  ...
      1   |  x3vsdf.sdsdf.sdfs3.kjdfkL |  ...
      2 ...

 SELECT id, decrypt_udf(encrypted_value) FROM my_table

The tricky part is decrypt_udf, requires contacting an external service (an RPC sending encrypted_value and receving decrypted value). Therefore, I’d need to initialize a connection when the plugin is loaded and persist the connection, so that we don’t need to pay the cost of connection initialization each time a decrypt_udf is called by a worker.

I see the simple scalar UDFs are all implemented as static functions, is it correct to assume our statically initialized connection will be reused whenever decrypt_udf is called? basically I am trying to figure out where to keep my connection and do the initialization so that it’s not repeated each time.

Last question, obviously it’s much more efficient to batch and send multiple decrypt requests in one RPC as opposed to do it multiple times. Imagine, my query:

 SELECT id, decrypt_udf(encrypted_value) FROM my_table LIMIT 100

This means decrypt_udf will be called 100 times. Do presto workers call decrypt_udf sequentially or is there any concurrency going on? if they are called concurrently I can keep them in memory, batch them and send them all together (block decrypt_udf, till the result is read). If not, what’s the proper way to batch these calls? is UDF even the right approach?

Thank you so much for your help.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
rongrongcommented, Apr 16, 2019

In general we strongly discourage UDFs to contact external services. Scalar functions are generally assumed to be quick and efficient. Having RPC calls is very far from that assumption. And the implications of slow UDFs are not well tested in Presto. One way to work around this is to use the thrift connector. So instead of using decrypt_udf(encrypted_value), you can write something like

SELECT id, decrypted_value
FROM my_table t
JOIN decrypt_service d
on t.encrypted_value = d.encrypted_value

and decrypt_service is a table backed by thrift connector.

We are exploring ideas on supporting external service udfs but that’s probably late 2019 or 2020 effort.

1reaction
cemcayiroglucommented, Apr 17, 2019

@bshafiee Not yet. I am going to create one when I have a more concrete idea about the design.

@rongrong good point. But I think we should have the option for both deployment models. Generally speaking, we need to think about data locality. Thrift server can be colocated to workers or in the same RAC.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Example uses of user-defined functions (UDFs)
Example uses of user-defined functions (UDFs) · Accessing external components using Amazon Redshift Lambda UDFs · Translate and analyze text using SQL functions ......
Read more >
Introducing SQL User-Defined Functions - Databricks
In this blog, we will walk you through some key use cases of SQL UDFs with examples.
Read more >
UDFs (User-Defined Functions) - Snowflake Documentation
Java: A Java UDF lets you use the Java programming language to manipulate data and return either scalar or tabular results. JavaScript: A...
Read more >
User Defined Function (UDF) - Data Engineering Glossary
For example, a complex calculation can be programmed using SQL and stored as a UDF. When this calculation needs to be used in...
Read more >
User-Defined Functions (UDFs) - Apache Impala
Depending on your use case, you might write all-new functions, reuse Java UDFs that you have already written for Hive, or port Hive...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found