question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Roadmap 2022 H2 (discussion)

See original GitHub issue

This is a working issue for folks to provide feedback on the prioritization of the Delta Lake priorities spanning July to December 2022. With the release of Delta Lake 2.0, we wanted to take the opportunity to discuss other vital features for prioritization with the community based on the feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), the Roadmap 2022H2 (discussion), and more.

Note, tasks that are crossed out (i.e., ~00~) have been completed.

To review the Delta Rust roadmap only, please refer to https://go.delta.io/rust-roadmap for more information.

Priority 0

We will focus on these issues and continue to deliver parts (or all of the issue) over the next six months

Issue Category Task Description
~256~ Flink Flink Source Build Flink source to read Delta tables in batch and streaming jobs
238 Flink Flink SQL+ Table API + Catalog Support After Flink Sink and Source, build support for Flink Catalog, SQL, and Table API
411, 410 Flink Productionize support for all cloud object stores Make sure that Flink Sink can write robustly to S3, GCS, ADLS2 with full transactional guarantees
~610~ Rust Integrate with a common object-store abstraction from arrow / Rust ecosystem This will allow us to provide a more convenient and performant API on the Rust and python side
~575~ Rust Support V2 writer protocol Utilize PyArrow-based writer function (write_deltalake) support writer protocol V2 and object stores S3, GCS, and ADLS2.
~761~ Rust Expand write support for cloud object stores Write to object stores S3, GCS, and ADLS2 from multiple clusters with full transactional guarantees
Rust DAT Integration Delta Acceptance Tests running in CI
Rust Rust documentation First pass at Rust docs
Rust Rust blogging Blog post for the Rust community
632 Rust Commit protocol Fully protocol compliant optimistic commit protocol
851 Rust Rust writer Refactor Rust writer API to be flexible for others wishing to build upon delta-rs
~1257~ Spark Release Delta 2.1 on Apache Spark 3.3 Ensure the latest version of Delta Lake works with the latest version of Apache Spark™
1367 Spark Support reading tables with Deletion Vectors Allow reads on tables that have deletion vectors to mark rows in parquet files as removed.
1408 Spark Support Table Features protocol Upgrade the protocol to use Table Features to specify the features needed to read/write to a table.
~1242~ Spark Support time travel SQL syntax Delta currently supports time travel via Python and Scala APIs. We would like to extend support for the SQL syntax VERSION AS OF and TIMESTAMP AS OF in SELECT statements.
Standalone Extend Delta Standalone for higher protocol versions Extend Delta Standalone to support logs using higher protocol versions and advanced features like constraints, generated columns, column mapping, etc.
Standalone Expand support for data skipping in Delta Standalone Extend the current data skipping to skip file using column stats and more expressions
Website Updated Delta Lake documentation Move Delta Lake documentation to the website GitHub repo to allow easier community collaboration
Website Consolidate all connector documentation Consolidate docs of all connectors in the website Github repo

Priority 1

We should be able to deliver parts (or all of the issue) over the next six months

Issue Category Task Description
4 Core Delta Acceptance Testing (DAT) With various languages interacting with the Delta protocol (e.g., Delta Standalone, Delta Spark, Delta Rust, Trino, etc.), we propose to have the same reference tables and library of reference tests to ensure all Delta APIs remain in compliance.
1347 Core Support Bloom filters Improve query performance by utilizing bloom filters. The approach is TBD due to recent updates to Apache Parquet to support bloom filters.
1387 Core Enable Delta clone Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.
Delta connectors GoLang Delta connector Support GoLang reading a Delta Lake table natively
Delta connectors Improve partition filtering in Power BI client Improved partition filtering using built-in UI filters in Power BI
Delta connectors Pulsar Source connector Support Apache Pulsar reading a Delta Lake table natively
Flink Column stats generation in Flink Sink Make the Flink Delta sink generate column stats
Presto/Trino Support higher protocol versions in Presto and Trino Use Standalone to support higher protocol versions
Rust Delta Rust API Updates Update APIs and support more high-level operations on top of delta; this includes better conflict resolution
Rust Better support for large logs Better support for handling large Delta logs/snapshots
Sharing Connectors GoLang Delta Sharing client Support GoLang client for Delta Sharing
Sharing Connectors R Delta Sharing client Support R client for Delta Sharing
1072 Spark Support for Identity columns Create an identity column that will be automatically assigned a unique and statistically increasing (or decreasing if the step is negative) value.
Spark Support querying Change Data Feed (CDF) using SQL queries To support querying CDF using SQL queries in Apache Spark, we need to allow custom TVFs to be resolved using injected rules.
1156 Spark Support Auto Compaction Provide auto compaction functionality to simplify compaction tasks
1198 Spark Support Optimize Writes Optimize Spark to Delta Lake writes
#1462 Spark Enable converting from Iceberg to Delta Enable converting parquet-backed Iceberg tables to Delta tables without rewriting parquet files.
#1464 Spark Shallow clone Iceberg tables Enable shallow cloning parquet-backed Iceberg tables following the Delta protocols without the need to copy all of the data.
~1349~ Spark Improve semantics of column mapping and Change Data Feed Improve semantics of how column renames/drops (aka column mapping) interact with CDF and streaming

Priority 2

Nice to have

Issue Category Task Description
Sharing Share individual partitions Support Sharing individual partitions in Delta Sharing
Sharing Connectors Rust Delta Sharing client Support Rust client for Delta Sharing
Sharing Connectors Starburst/Trino Delta Sharing connector Support Starburst/Trino client for Delta Sharing
Sharing Connectors Airflow Delta Sharing connector Support sharing data from Airflow sensor
Rust Process Release improvements

History

  • 2022-08-01: Initial creation
  • 2022-08-02: Delta Sharing updates
  • 2022-08-08: Include Identity columns in the roadmap
  • 2022-09-13: Update issues and include into roadmap auto compaction, optimize writes, and bloom filters.
  • 2022-09-19: Update to include Delta Clone
  • 2022-09-22: Including working Delta Rust roadmap document
  • 2022-10-26: Included updated Delta Rust roadmap in GitHub link
  • 2022-10-27: Included converting and shallow cloning Iceberg to Delta

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:11
  • Comments:25 (13 by maintainers)

github_iconTop GitHub Comments

4reactions
dennygleecommented, Oct 26, 2022

Suggest we add Airbyte Destination S3: add delta lake/delta table support to the roadmap as it’s already part of the Delta Rust Roadmap - WDYT?

3reactions
tdascommented, Aug 25, 2022

“Delta caching” is actually a Databricks Runtime engine feature, not part of the format. Caching data on an processing engine’s executor/workers nodes is something that can really be done well by the engine itself, not by a data format. It’s unfortunate and confusing that we had marketed it under the “Delta” brand name, even though it’s really not part of the “Delta Lake” storage format. So, in short, its not really possible to open source that as part of Delta Lake.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DOE National Clean Hydrogen Strategy and Roadmap
U.S. Department of Energy – Sep 2022. 1. DOE National Clean Hydrogen Strategy and Roadmap (Draft). Table of Contents. Executive Summary .
Read more >
Delta Lake Roadmap
The following is the Delta Lake 2022 H2 Roadmap; for the latest updates, comments, and discussions; please refer to the Github source.
Read more >
The 2022 solar fuels roadmap - IOPscience
Each of the sections in the roadmap focuses on a single topic, discussing the state of the art, the key challenges and advancements...
Read more >
Global Hydrogen Review 2022 - NET
The action plan is under discussion among EU legislators ... hydrogen roadmap targets 1.5 GW of installed fuel cell capacity in the power...
Read more >
H2Houston Hub - Center for Houston's Future
The HyVelocity Hub will build on the goals outlined in the Houston Clean Hydrogen Roadmap below. Core to its vision, the HyVelocity Hub...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found