question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] provide OPTIMIZE with dataframe result in pyspark SQL interface

See original GitHub issue

Feature request

Maybe I’m doing something wrong but for some reason running OPTIMIZE from spark sql interface doesn’t seem to do anything. (Version 1.2.1)

It would be great if the OPTIMIZE command can be run from spark SQL interface in python and would return a dataframe containing the statistics as mentioned in the documentation.

Something like the following:

df = spark.sql("OPTIMIZE delta.`/path/to/delta/table`")

Overview

return dataframe containing statistics from the OPTIMIZE command

Motivation

Notebooks like Zeppelin are not used on our cloudera stack. So we need to run optimize either from the python API for DeltaTable or through the spark sql interface.

Further details

According to documentation the following SQL code should be executed for file compaction (automatic resizing based on OPTIMIZATION settings):

OPTIMIZE '/path/to/delta/table' -- Optimizes the path-based Delta Lake table

OPTIMIZE delta_table_name;

OPTIMIZE delta.`/path/to/delta/table`;

futher in the documentation it is mentioned that OPTIMIZE should return a set of statistics:

OPTIMIZE returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats also contains the number of batches, and partitions optimized

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
brysdcommented, May 9, 2022

@vkorukanti a bit too fast. Didn’t see your previous message. Ok clear. This can be closed. Thanks for your feedback and assistance!

1reaction
vkorukanticommented, May 9, 2022

@pedrosalgadowork @brysd The small files are not expected to be deleted as part of the OPTIMIZE. They are still part of the DeltaLog transaction (for table versions before the OPTIMIZE). If you want to cleanup the small files or files from very old commits, use vacuum.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best Practices — PySpark 3.3.1 documentation - Apache Spark
Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join ......
Read more >
Tutorial: Work with PySpark DataFrames on Databricks
Learn how to load and transform data using the Apache Spark Python (PySpark) DataFrame API in Databricks.
Read more >
SQL at Scale with Apache Spark SQL and DataFrames
This article covers detailed concepts pertaining to Spark, SQL and DataFrames. Besides this we also cover a hands-on case study around ...
Read more >
4. Spark SQL and DataFrames: Introduction to Built-in Data ...
In particular, we discussed how the Spark SQL engine provides a unified foundation for the high-level DataFrame and Dataset APIs. Now, we'll continue...
Read more >
Spark DataFrame - Intellipaat
Features of DataFrames · Use of Input Optimization Engine: DataFrames make use of the input optimization engines, e.g., Catalyst Optimizer, to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found