Proposal: Support partition with transform for Flink SQL
See original GitHub issueFor now, only the identity
partition transform is supported with flink SQL. To support other generic transforms, it is expected that Flink SQL supports defining partitions with transform. It depends on the support of Flink engine and the corresponding transformers. Here proposes another way to support partition with transform by composing the computed columns and partition list in flink.
The basic idea is that defining the partition field as a computed column in flink, with the mapping from the expression in Flink to the transform Iceberg (or vice versa), then use the column in partition list.
The only concern is how to map the expression from Flink to Iceberg. We can use user defined function (UDF) to resolve it.
The built in functions in Flink may not be enough to cover all partition transform in Iceberg.
Iceberg can provide all corresponding UDFs for all partition transforms from catalog.
For creating table, if a computed column is used in partition list and can be mapped as a partition transform, then Iceberg will save it as a partition field with the following rule:
-
the column name will be used as the partition field name;
-
the parameter column name will be used as the source field name;
other kinds of computed columns are still not supported.
For loading table, a partition field will be mapped as a computed column and one element in partition list with the following rule:
-
the partition field name will be used as the column name;
-
the transform will be mapped as a UDF function invoking expression for the column expression;
-
the type of partition field will be mapped as the type of the column.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
I think we’ve been stuck on FLink integrations such as compute columns for too long. In Flink’s flow calculation, the computed columns and so on is a rigid requirement. I agree with @kbendick that we should mention it at the next community meeting and form a design team to finish it as soon as possible. Any progress is better than standing still. So far, I’ve seen three or four different proposals. We should focus on them and then design a workable proposal and get it done. Without this support, Flink streaming tasks simply don’t work.
Fully understood. I’m glad this discussion is coming up.
I think a few people have given a few different ideas, and I think that some combination of them will likely wind up being best. For example, I really like @wuwenchi’s idea of using UDFs for transform functions vs relying on the column names.
One concern with column names is that then would break behavior for users who already have fields with those names (eg they’ve got a data field called ts_month that isn’t intended as a partition transform). I know generally speaking that’s how partition transform fields wind up being named, but we don’t rely on the name but the
PartitionSpec
and associatedPartitionField
s and users are free to rename the partition columns as they choose.But I’m glad this is being brought up. Possibly as the next community sync this can be brought up and then a small working group or anybody interested could come to some design meetings? Throwing that out there as a consideration.
Appreciate your taking the time to propose this @yittg (as well as work done by @wuwenchi @hililiwei and others that I apologize if I missed).