Roadmap 2022 H1 (discussion)
See original GitHub issueThis is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!
Performance Optimizations
Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.
Issue | Description | Target CY2022 |
---|---|---|
927 | OPTIMIZE (file compaction): Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size | Released in 1.2 |
923 | File skipping using columns stats: This is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses) on non-partitionBy columns. | Released in 1.2 |
931 | Automatic data skipping using generated columns: Enhance generated columns to include automatic data skipping | Released in 1.2 |
1134 | OPTIMIZE ZORDER: Data clustering via multi-column locality-preserving space-filling curves with offline sorting. | Q3/Q4 |
MERGE Performance Improvements: We will be providing a project improvement plan (PIP) document shortly on the proposed design for discussion. | Q2/Q3 |
Schema Operations
For this year, our focus will be on columnar mappings.
Issue | Description | Target CY2022 |
---|---|---|
958 | Support for renaming column: Rename column with ALTER TABLE | Released in 1.2 |
957 | Support for arbitrary column names: Support characters in column names not allowed by Parquet | Released in 1.2 |
1064 | Support for dropping columns: Drop column with ALTER TABLE | Released in 2.0 |
348 | Support for dynamic partition overwrite: Currently you can overwrite using the replaceWhere option but in various scenarios, it is more convenient to specify overwrite partition. |
Q2 |
Integrations
Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.
Issue | Description | Target CY2022 |
---|---|---|
112 | Delta Source for Apache Pulsar: Build a Pulsar/Delta reader leveraging Delta Standalone. Join us via the Delta Users Slack #connector-pulsar channel. | Q2 |
238 | Flink Sink on Table API: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) using the Apache Flink Table API. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. | Q2/Q3 |
110 | Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) leveraging Delta Standalone. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. | Q2/Q3 |
82 | Delta Source for Trino: Joint Delta Lake and Trino community collaboration on the following PRs: 10987, 10300. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays. | Released |
Delta Source for Big Query: Allows Big Query to natively read Delta Lake tables. | Q2/Q3 | |
523, 566 | Delta Rust Writer: Extending Delta Rust API to write to Delta Lake. | Q2/Q3 |
Hive/Delta writer: Extending Hive to write to Delta Lake | Q3 |
Operations Enhancements
Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.
Issue | Description | Target CY2022 |
---|---|---|
903, 863 | Table Restore: Rollback to a previous version of a Delta table using Python, Scala, and/or SQL APIs. | Released in 1.2 |
41 | S3 multi-cluster writes: Allows multiple clusters/drivers/JVMs to concurrently write to S3 using DynanoDB as the lock store. Please refer to this PIP: [2021-12-22] Delta OSS S3 Multi-Cluster Writes | Released in 1.2 |
747 | delta.io.Guide: Enhance the Delta Lake documentation by creating a new guide (PIP will follow soon) | Q2/Q3 |
Iceberg to Delta Converter: Ability to convert Iceberg table to Delta table without a rewrite. | Q3 | |
Table Cloning: Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not. | Q3 | |
1105 | Change Data Feed: The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. | Q2 |
Updates
- 2022-05-18: Include Issue 348 for the dynamic partition overwrite feature request
- 2022-05-03: Updated tables with Delta Lake 1.2 release.
- 2022-03-08: Based on community feedback, we are also prioritizing Hive/Delta writer, clones, and CDF
If there are other issue that should be considered within this roadmap, let’s have a discussion here or via the Delta Users Slack #deltalake-oss channel.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:60
- Comments:18 (9 by maintainers)
Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!
It would be great if the CDF was open source on the latest date. I really interest with this feature!