Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: FlinkSQL supports partition transform by computed columns

See original GitHub issue

Goal

Flink-sql supports creating tables with hidden partitions.

Example

Create a table with hidden partitions:

CREATE TABLE tb (
  ts TIMESTAMP,
  id INT,
  prop STRING,
  par_ts AS days(ts),                --- transform partition: day
  par_prop AS truncates(6, prop)     --- transform partition: truncate
) PARTITIONED BY (
  par_ts, id, par_prop               --- use transform/identity partition
);

Supported Functions

years (col)
months (col)
days (col)
hours (col)
truncates (width, col)
buckets (width, col)

Solution

We have created the corresponding UDF for each partition transform, and register these in the catalog, so we can use these functions directly.
Restriction: Because computed columns are not currently supported, when there are computed columns in DDL, these computed columns must be in partition keys. ( Here is a PR that supports computed columns #4625 , if support this feature, we don’t need this restriction. )
By analyzing the expression of the computed column, we can get all the information about the partition key:

Name of the computed column can be used to column name of the partition key.
Function name in computed column’s expression can be mapped to transform expression.
Arguments in computed column’s expression correspond to the arguments of transform expression, including source column and possible width.

Issue Analytics

State:
Created a year ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

kbendickcommented, Jun 13, 2022

So far, I want to say that I like a lot about where this is going @wuwenchi.

I think working on the UDF transforms would be a good first step (as those would have benefit regardless).

I think this has no effect on other engines, because watermarks and computed columns themselves do not actually store data, they just will add some logical processing when querying, but these logics only take effect on flink.What is actually stored is the data of the original physical columns, and the related format has not changed.

My one concern here would be that users often use one engine, say Flink, for writing and then other engines later for processing. It might be the case that people want these computed columns reflected in the data (possibly even stored, though in the case of partition transforms and partitions in general, the partition fields value isn’t generally stored multiple times).

It might be the case we cannot do certain things other than with Flink. Watermarks might be one of those. Though there could be steps we could take to make as much of this information available to downstream consumers as possible.

For example, it has been discussed before to use the iceberg sequence ID as a form of a watermark (as it’s generally monotonically increasing). While other engines might not have native support for watermarks in them, at least having the data available would be beneficial.

TLDR - Again, I think the UDFs you mentioned would be great to work on first, as those have value regardless of how we proceed next (for example, users might want to query with Iceberg’s bucket function as a way to more narrowly specify a subset of data to perform an action on).

Of course, maybe the above things can be implemented with Calcite, but since I am not particularly familiar with Calcite, the implementation may be more complicated, and it may also require the cooperation of flink, so we prefer to use this simpler way.

I don’t have much knowledge of Calcite either, but in the more medium to long term, I think it would be good to reach out to the Flink community to possibly have some of these concepts more natively supported.

My concern with using names directly to infer the function is that many users might have columns with those names (_years, etc) already in the data.

But with the approach of the UDF, that issue goes away (as users can already choose the partition column name in Spark for example using ALTER TABLE … ADD PARTITION FIELD bucket(16, id) AS shard.

Would you be interested in doing a POC or PR of just the transformation functions at first? Then at the community sync up we can possibly bring this up and form a working group to get input from others on this subject 🙂

0reactions

github-actions[bot]commented, Dec 12, 2022

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label ‘not-stale’, but commenting on the issue is preferred when possible.

Top Results From Across the Web

FLIP-107: Handling of metadata in SQL connectors - Apache

Currently, the LIKE clause offers the following table features: CONSTRAINTS - constraints such as primary and unique keys; GENERATED - computed columns; OPTIONS ......

Deep Insights into Flink SQL: Flink Advanced Tutorials

Currently, DDL does not support the definition of computed columns and watermarks, but this will be improved in later versions of Flink.

Re: [DISCUSS] Flink SQL DDL Design - The Mail Archive

support table constraint PRIMARY KEY and UNIQUE 5. support table properties using key-value pairs 6. support partitioned by 7. support computed column 8....

create table - Flink Open Source SQL 1.12 Syntax ... - 华为云

Create a table with a specified name.COMPUTED COLUMNA computed column is a virtual column generated using column_name AS ...

Alibaba Cloud Realtime Compute（StreamCompute）

Realtime Compute supports several RDS engines such as MySQL. ... fields. • Flink SQL syntax highlighting. Flink SQL keywords are highlighted in different ......