question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Enable LOAD DATA for delta tables.

See original GitHub issue

Feature request

Overview

Currently Deltalake - Databricks has the “COPY INTO” DML statement, and vanilla parquet datasets in spark support the “LOAD DATA” DML statement. However, currently there doesn’t seem to be work regarding this in deltalake that I can find. There’s also currently tests that make sure they raise not supported warnings.

Motivation

Currently Delta has great support for inserting and writing NEW data into delta tables. However, patterns where we want to insert a new existing parquet file into a delta table currently require us to read the file into memory, and write it into the table.

Ideally, for cases where the file already exists and can be simply ‘copied’ into the delta table, we would support a DML statement to do this but also track the changes in the delta log.

This should make some use cases of delta more efficient, for example, writing staging partitions somewhere else before testing them, then using LOAD DATA to load them into the final delta table.

Further details

I’d need to look more into this to figure out exactly what would need to be done, but, I’d imagine it something like:

  1. Inspect parquet file path for schema compatibility
  2. Check for partition spec
  3. calculate delta log changes
  4. copy file into new location
  5. commit transaction log

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time. ( I’d be willing, but if this isn’t a good ‘first’ issue, it may require more knowledge/expertise than I have)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
zsxwingcommented, Sep 14, 2022

This is definitely something worth to look at. We are not putting this on our roadmap (#1307) right now as there are many items there already. But if anyone in the community would like to give it a try, feel free to discuss it in the issue.

0reactions
zsxwingcommented, Sep 27, 2022

Can this feature be structured as a standalone code module, dynamic library and/or CLI, that interfaces with a service API that operates on data passed from memory (and therefore source data file format (parquet, json, etc) is moot as the API only cares about cell data bytes that are sent to the API)?

We would leverage Spark to read other source format (parquet, json, etc.) We don’t want to re-build them.

Can any existing code bases (such as freeTDS) be modified for this purpose?

There is probably no existing code you can use as an example.

Are there any restrictions on what language or dev environment this would need to be built in (presumably, if the answer to question 1 is “Yes”, then the answer to this is “No”)?

We prefer Scala since the entire code path is using Scala heavily.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Load data into the Databricks Lakehouse
Run your first ETL workload; Auto Loader; Automate ETL with Delta Live Tables and Auto Loader; Upload local data files or connect external...
Read more >
Feature Request - Load all versions of table at once. #585
It would be useful to query all available versions of delta files into a single DataFrame instead of a single version.
Read more >
Delta Live Tables concepts - Azure Databricks - Microsoft Learn
Learn about the core features of Azure Databricks Delta Live Tables. ... to clear all data from each table and then load all...
Read more >
Table deletes, updates, and merges - Delta Lake Documentation
You can update data that matches a predicate in a Delta table. For example, in a table named people10m or a path at...
Read more >
Ingest data into Delta Lake on Azure Databricks - YouTube
Part 1 of our three part series on working with Databricks and Azure Maps to create a geospatial forecast. In part 1, we'll...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found