question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: FlinkSQL supports partition transform by computed columns

See original GitHub issue

Goal

Flink-sql supports creating tables with hidden partitions.

Example

Create a table with hidden partitions:

CREATE TABLE tb (
  ts TIMESTAMP,
  id INT,
  prop STRING,
  par_ts AS days(ts),                --- transform partition: day
  par_prop AS truncates(6, prop)     --- transform partition: truncate
) PARTITIONED BY (
  par_ts, id, par_prop               --- use transform/identity partition
);

Supported Functions

years (col)
months (col)
days (col)
hours (col)
truncates (width, col)
buckets (width, col)

Solution

  1. We have created the corresponding UDF for each partition transform, and register these in the catalog, so we can use these functions directly.
  2. Restriction: Because computed columns are not currently supported, when there are computed columns in DDL, these computed columns must be in partition keys. ( Here is a PR that supports computed columns #4625 , if support this feature, we don’t need this restriction. )
  3. By analyzing the expression of the computed column, we can get all the information about the partition key:
  • Name of the computed column can be used to column name of the partition key.
  • Function name in computed column’s expression can be mapped to transform expression.
  • Arguments in computed column’s expression correspond to the arguments of transform expression, including source column and possible width.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
kbendickcommented, Jun 13, 2022

So far, I want to say that I like a lot about where this is going @wuwenchi.

I think working on the UDF transforms would be a good first step (as those would have benefit regardless).

I think this has no effect on other engines, because watermarks and computed columns themselves do not actually store data, they just will add some logical processing when querying, but these logics only take effect on flink.What is actually stored is the data of the original physical columns, and the related format has not changed.

My one concern here would be that users often use one engine, say Flink, for writing and then other engines later for processing. It might be the case that people want these computed columns reflected in the data (possibly even stored, though in the case of partition transforms and partitions in general, the partition fields value isn’t generally stored multiple times).

It might be the case we cannot do certain things other than with Flink. Watermarks might be one of those. Though there could be steps we could take to make as much of this information available to downstream consumers as possible.

For example, it has been discussed before to use the iceberg sequence ID as a form of a watermark (as it’s generally monotonically increasing). While other engines might not have native support for watermarks in them, at least having the data available would be beneficial.

TLDR - Again, I think the UDFs you mentioned would be great to work on first, as those have value regardless of how we proceed next (for example, users might want to query with Iceberg’s bucket function as a way to more narrowly specify a subset of data to perform an action on).

Of course, maybe the above things can be implemented with Calcite, but since I am not particularly familiar with Calcite, the implementation may be more complicated, and it may also require the cooperation of flink, so we prefer to use this simpler way.

I don’t have much knowledge of Calcite either, but in the more medium to long term, I think it would be good to reach out to the Flink community to possibly have some of these concepts more natively supported.

My concern with using names directly to infer the function is that many users might have columns with those names (_years, etc) already in the data.

But with the approach of the UDF, that issue goes away (as users can already choose the partition column name in Spark for example using ALTER TABLE … ADD PARTITION FIELD bucket(16, id) AS shard.

Would you be interested in doing a POC or PR of just the transformation functions at first? Then at the community sync up we can possibly bring this up and form a working group to get input from others on this subject 🙂

0reactions
github-actions[bot]commented, Dec 12, 2022

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label ‘not-stale’, but commenting on the issue is preferred when possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FLIP-107: Handling of metadata in SQL connectors - Apache
Currently, the LIKE clause offers the following table features: CONSTRAINTS - constraints such as primary and unique keys; GENERATED - computed columns; OPTIONS ......
Read more >
Deep Insights into Flink SQL: Flink Advanced Tutorials
Currently, DDL does not support the definition of computed columns and watermarks, but this will be improved in later versions of Flink.
Read more >
Re: [DISCUSS] Flink SQL DDL Design - The Mail Archive
support table constraint PRIMARY KEY and UNIQUE 5. support table properties using key-value pairs 6. support partitioned by 7. support computed column 8....
Read more >
create table - Flink Open Source SQL 1.12 Syntax ... - 华为云
Create a table with a specified name.COMPUTED COLUMNA computed column is a virtual column generated using column_name AS ...
Read more >
Alibaba Cloud Realtime Compute(StreamCompute)
Realtime Compute supports several RDS engines such as MySQL. ... fields. • Flink SQL syntax highlighting. Flink SQL keywords are highlighted in different ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found