[SUPPORT] Retrieving latest completed commit timestamp via HoodieTableMetaClient in PySpark
See original GitHub issueDescribe the problem you faced
I am not experiencing a problem. I would however like to request advice/peer review to ensure I am using the Hudi Java classes and methods in the most appropriate manner.
Goal: Retrieve the timestamp of the latest completed commit in a Hudi table, loading only Hudi metadata files from S3 in the process.
Sample code below in the To Reproduce is the approach I am using to accomplish this goal in a PySpark ETL script via HoodieTableMetaClient.
The overall idea is to save the timestamp of the latest completed commit on the source Hudi table as a bookmark so the next ETL script run can process only the incremental changes after that point.
General questions:
- Is this approach valid? If not, what alternative do you suggest?
- Do the Hudi classes and methods I use have relatively stable public interfaces that are not likely to change significantly over time?
- As development progresses, are there any plans to expose parts of Hudi’s API via Python?
I appreciate your time and expertise! Thanks for creating and maintaining this incredible framework!
To Reproduce
Sample code:
# sc already exists within the PySpark session.
source_path = "s3a://example-bucket/example-table/"
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
client = (
sc._jvm
.org.apache.hudi.common.table.HoodieTableMetaClient
.builder()
.setConf(sc._jsc.hadoopConfiguration())
.setBasePath(source_path)
.build()
)
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
timeline = client.getCommitsTimeline().filterCompletedInstants()
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/util/Option.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
last_instant = timeline.lastInstant().orElse(None)
if last_instant:
last_processed = last_instant.getTimestamp()
Environment Description
-
Hudi version : 0.9.0
-
Spark version : 3.1.1
-
Hive version : 2.3.7
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS…) : S3A
-
Running on Docker? (yes/no) : yes
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Ok @bryanburke i think your approach is valid. metaclient APIs should be quite stable and even if in case of change, there should be a deprecation period to allow transition. You may also consider this example to get the latest commit
https://hudi.apache.org/docs/quick-start-guide#incremental-query
As for more python API support, we don’t have this ranked up high in the roadmap. If you’re keen, please feel free to drive this feature. You could start by sending a [DISCUSS] email in the dev email list to gather more inputs. Thanks for illustrating your ideas. Closing this now. Feel free to follow up here or through email list.
@xushiyan Thank you for your response! I had no idea Hudi provides event-driven features, so your suggestions are helping me learn quite a bit more about the framework. While I do not believe we have a use case currently for
SourceCommitCallback
andS3EventsSource
(see below), we may in the future if we start processing streaming data or require a long-running Spark cluster.Reading over my original post above, I believe I did a somewhat poor job defining our exact use case. I can provide some more details for context:
Given the above, I do not believe
SourceCommitCallback
meets our use case, as the downstream job that reads the Hudi table in S3 runs on a separate schedule and (most likely) a completely different transient Spark cluster. However, please feel free to provide additional insight if I am missing something.Regarding my other question about exposing Hudi APIs via Python, I am referring to
HoodieTableMetaClient
itself and associated timeline classes (for example, as requested in HUDI-1998). However, I suppose the question could also extend to the configuration classes likeHoodieWriteConfig
andDataSourceWriteOptions
that the 0.9.0 release notes indicate are now preferable over string variables.