[Feature Request] Enable LOAD DATA for delta tables.
See original GitHub issueFeature request
Overview
Currently Deltalake - Databricks has the “COPY INTO” DML statement, and vanilla parquet datasets in spark support the “LOAD DATA” DML statement. However, currently there doesn’t seem to be work regarding this in deltalake that I can find. There’s also currently tests that make sure they raise not supported warnings.
Motivation
Currently Delta has great support for inserting and writing NEW data into delta tables. However, patterns where we want to insert a new existing parquet file into a delta table currently require us to read the file into memory, and write it into the table.
Ideally, for cases where the file already exists and can be simply ‘copied’ into the delta table, we would support a DML statement to do this but also track the changes in the delta log.
This should make some use cases of delta more efficient, for example, writing staging partitions somewhere else before testing them, then using LOAD DATA
to load them into the final delta table.
Further details
I’d need to look more into this to figure out exactly what would need to be done, but, I’d imagine it something like:
- Inspect parquet file path for schema compatibility
- Check for partition spec
- calculate delta log changes
- copy file into new location
- commit transaction log
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- Yes. I can contribute this feature independently.
- Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- No. I cannot contribute this feature at this time. ( I’d be willing, but if this isn’t a good ‘first’ issue, it may require more knowledge/expertise than I have)
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
This is definitely something worth to look at. We are not putting this on our roadmap (#1307) right now as there are many items there already. But if anyone in the community would like to give it a try, feel free to discuss it in the issue.
We would leverage Spark to read other source format (parquet, json, etc.) We don’t want to re-build them.
There is probably no existing code you can use as an example.
We prefer Scala since the entire code path is using Scala heavily.