Proposal: FlinkSQL supports partition transform by computed columns
See original GitHub issueGoal
Flink-sql supports creating tables with hidden partitions.
Example
Create a table with hidden partitions:
CREATE TABLE tb (
ts TIMESTAMP,
id INT,
prop STRING,
par_ts AS days(ts), --- transform partition: day
par_prop AS truncates(6, prop) --- transform partition: truncate
) PARTITIONED BY (
par_ts, id, par_prop --- use transform/identity partition
);
Supported Functions
years (col)
months (col)
days (col)
hours (col)
truncates (width, col)
buckets (width, col)
Solution
- We have created the corresponding UDF for each partition transform, and register these in the catalog, so we can use these functions directly.
- Restriction: Because computed columns are not currently supported, when there are computed columns in DDL, these computed columns must be in partition keys. ( Here is a PR that supports computed columns #4625 , if support this feature, we don’t need this restriction. )
- By analyzing the expression of the computed column, we can get all the information about the partition key:
- Name of the computed column can be used to column name of the partition key.
- Function name in computed column’s expression can be mapped to transform expression.
- Arguments in computed column’s expression correspond to the arguments of transform expression, including source column and possible width.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
FLIP-107: Handling of metadata in SQL connectors - Apache
Currently, the LIKE clause offers the following table features: CONSTRAINTS - constraints such as primary and unique keys; GENERATED - computed columns; OPTIONS ......
Read more >Deep Insights into Flink SQL: Flink Advanced Tutorials
Currently, DDL does not support the definition of computed columns and watermarks, but this will be improved in later versions of Flink.
Read more >Re: [DISCUSS] Flink SQL DDL Design - The Mail Archive
support table constraint PRIMARY KEY and UNIQUE 5. support table properties using key-value pairs 6. support partitioned by 7. support computed column 8....
Read more >create table - Flink Open Source SQL 1.12 Syntax ... - 华为云
Create a table with a specified name.COMPUTED COLUMNA computed column is a virtual column generated using column_name AS ...
Read more >Alibaba Cloud Realtime Compute(StreamCompute)
Realtime Compute supports several RDS engines such as MySQL. ... fields. • Flink SQL syntax highlighting. Flink SQL keywords are highlighted in different ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So far, I want to say that I like a lot about where this is going @wuwenchi.
I think working on the UDF transforms would be a good first step (as those would have benefit regardless).
My one concern here would be that users often use one engine, say Flink, for writing and then other engines later for processing. It might be the case that people want these computed columns reflected in the data (possibly even stored, though in the case of partition transforms and partitions in general, the partition fields value isn’t generally stored multiple times).
It might be the case we cannot do certain things other than with Flink. Watermarks might be one of those. Though there could be steps we could take to make as much of this information available to downstream consumers as possible.
For example, it has been discussed before to use the iceberg sequence ID as a form of a watermark (as it’s generally monotonically increasing). While other engines might not have native support for watermarks in them, at least having the data available would be beneficial.
TLDR - Again, I think the UDFs you mentioned would be great to work on first, as those have value regardless of how we proceed next (for example, users might want to query with Iceberg’s bucket function as a way to more narrowly specify a subset of data to perform an action on).
I don’t have much knowledge of Calcite either, but in the more medium to long term, I think it would be good to reach out to the Flink community to possibly have some of these concepts more natively supported.
My concern with using names directly to infer the function is that many users might have columns with those names (_years, etc) already in the data.
But with the approach of the UDF, that issue goes away (as users can already choose the partition column name in Spark for example using
ALTER TABLE … ADD PARTITION FIELD bucket(16, id) AS shard
.Would you be interested in doing a POC or PR of just the transformation functions at first? Then at the community sync up we can possibly bring this up and form a working group to get input from others on this subject 🙂
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label ‘not-stale’, but commenting on the issue is preferred when possible.