question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE REQUEST]: UDF to return custom business objects

See original GitHub issue

I have a C# component which is presently running in Azure Data lake and i am planning to move to Spark and reuse the same component. My example scenario

C# takes an input of Manager Dataset like

mgrId name
11 ABC
22 DEF

C# component returns a List of Reportee, where Reportee is Defined as Class { public int EmpId; public string Name; public string Role; public int MgrId; }

Reportee dataset

empId name role mgrId
100 pqr admin 11
200 stu reader 11
300 wxy reader 22

intended UDF var udf = Udf<int, List<Reportee>>((mgrId) => return component.Execute(mgrId); });

for each row in my Manager dataset, i have to call UDF to get final result in spark as

mgrId mgrname empname empid Role
11 ABC pqr 100 admin
11 ABC stu 200 reader
22 DEF wxy 300 reader

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:25 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
imback82commented, May 22, 2019

The UDT (user-defined type) as a return type of UDF will not be supported. (UDT API in Spark became private since 2.0, and not much traction in PR, etc.)

However, we plan to achieve something similar using StructType. This is how it’s done in PySpark.

I did a quick prototype, and it looks like the following:

var schema = new StructType(new[] {
    new StructField("col1", new IntegerType()),
    new StructField("col2", new StringType()) });
// The schema is hard-coded inside Udf<> for POC, but "Row" class will be used to embed
// schema and object[].
var udf = Udf<string, object[]>((str) => new object[] { 1, "abc" });

// Assume that we have a df that has:
//+----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+
var udfDf = df.Select(udf(df["name"]).As("udf_col"));

// PrintSchema() prints:
// root
// |-- udf_col: struct (nullable = true)
// |    |-- col1: integer (nullable = false)
// |    |-- col2: string (nullable = true)
udfDf.PrintSchema();

// Show() prints:
// +--------+
// | udf_col|
// +--------+
// |[1, abc]|
// |[1, abc]|
// |[1, abc]|
// +--------+
udfDf.Show();

// Flatten nested column as follows:
// +----+----+
// |col1|col2|
// +----+----+
// |   1| abc|
// |   1| abc|
// |   1| abc|
// +----+----+
udfDf.Select(udfDf["udf_col.col1"], udfDf["udf_col.col2"]).Show();
// or
udfDf.Select("udf_col.*").Show();

This feature be available in coming weeks.

1reaction
imback82commented, Dec 10, 2019

Why not something like Udf_SomeOtherName<T>(T t, Schema s) for UDFs that return Row? It’s awkward to allow schema for almost all other types that don’t even need it.

So we have two options:

  1. Have a different name for UDF that returns Row type. The API is clear, but the downside is we need a separate set; you cannot specialize on generic types in C#.
  2. Allow specifying schema for UDF, but we need to make sure there is no confusion on Row and GenericRow, especially if you are going from an object with less info to more info (GenericRow -> Row, how can it logically be possible?)

We can explore both options and get some early feedback.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Spark UDF that returns dynamic data types
All the row_type 's are set dynamically. I can great Schema for each row_type , but I cannot make same UDF return results...
Read more >
User-defined scalar functions - Python
Call the UDF in Spark SQL; Use UDF with DataFrames; Evaluation order and null checking. Register a function as a UDF. Python.
Read more >
How to Write Spark UDF (User Defined Functions) in Python
For this, all we have to do use @ sign(decorator) in front of udf function, and give the return type of the function...
Read more >
N1QL Now Supports User-Defined Functions
Return Values with UDFs. User-defined functions only return one value of any type. If you need to return more than one value, return...
Read more >
User-defined functions | BigQuery
A UDF accepts columns of input, performs actions on the input, and returns the result of those actions as a value. You can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found