Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: Support partition with transform for Flink SQL

See original GitHub issue

For now, only the identity partition transform is supported with flink SQL. To support other generic transforms, it is expected that Flink SQL supports defining partitions with transform. It depends on the support of Flink engine and the corresponding transformers. Here proposes another way to support partition with transform by composing the computed columns and partition list in flink.

The basic idea is that defining the partition field as a computed column in flink, with the mapping from the expression in Flink to the transform Iceberg (or vice versa), then use the column in partition list.

The only concern is how to map the expression from Flink to Iceberg. We can use user defined function (UDF) to resolve it.

The built in functions in Flink may not be enough to cover all partition transform in Iceberg.

Iceberg can provide all corresponding UDFs for all partition transforms from catalog.

For creating table, if a computed column is used in partition list and can be mapped as a partition transform, then Iceberg will save it as a partition field with the following rule:

the column name will be used as the partition field name;
the parameter column name will be used as the source field name;

other kinds of computed columns are still not supported.

For loading table, a partition field will be mapped as a computed column and one element in partition list with the following rule:

the partition field name will be used as the column name;
the transform will be mapped as a UDF function invoking expression for the column expression;
the type of partition field will be mapped as the type of the column.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

hililiweicommented, Jun 14, 2022

I think we’ve been stuck on FLink integrations such as compute columns for too long. In Flink’s flow calculation, the computed columns and so on is a rigid requirement. I agree with @kbendick that we should mention it at the next community meeting and form a design team to finish it as soon as possible. Any progress is better than standing still. So far, I’ve seen three or four different proposals. We should focus on them and then design a workable proposal and get it done. Without this support, Flink streaming tasks simply don’t work.

1reaction

kbendickcommented, Jun 13, 2022

Can you provide a user side example of your idea @yittg? I want to make sure I understand it fully. This feature is not required for my personal scenario. In our platform, the metadata is not managed by Flink SQL, so I have no example about this though.

This issue was primarily intended to initiate a discussion with some preliminary idea.

Fully understood. I’m glad this discussion is coming up.

I think a few people have given a few different ideas, and I think that some combination of them will likely wind up being best. For example, I really like @wuwenchi’s idea of using UDFs for transform functions vs relying on the column names.

One concern with column names is that then would break behavior for users who already have fields with those names (eg they’ve got a data field called ts_month that isn’t intended as a partition transform). I know generally speaking that’s how partition transform fields wind up being named, but we don’t rely on the name but the PartitionSpec and associated PartitionFields and users are free to rename the partition columns as they choose.

But I’m glad this is being brought up. Possibly as the next community sync this can be brought up and then a small working group or anybody interested could come to some design meetings? Throwing that out there as a consideration.

Appreciate your taking the time to propose this @yittg (as well as work done by @wuwenchi @hililiwei and others that I apologize if I missed).

Top Results From Across the Web

Flink Improvement Proposals - Apache Software Foundation

This page describes the Flink Improvement Proposal (FLIP) process for ... FLIP-190: Support Version Upgrades for Table API & SQL Programs.

Deep Insights into Flink SQL: Flink Advanced Tutorials

This section explains how to implement transformation to a JobGraph through the Table API and SQL to help you better understand the Blink ......

How to write fast Flink SQL - Ververica

This means that a program written with the DataStream API will transform into an execution graph without any optimizations, whereas a program written...

Flink SQL 1.14 : Match Recognize doesn't support consuming ...

Your temporary view event_with_eventReference results in a changelog stream where retractions and deletions can happen because of the join.

Build a Lake House Architecture on AWS | AWS Big Data Blog

It supports storage of data in structured, semi-structured, ... to perform a variety of transformations, including data warehouse style SQL, ...