Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Retrieving latest completed commit timestamp via HoodieTableMetaClient in PySpark

See original GitHub issue

Describe the problem you faced

I am not experiencing a problem. I would however like to request advice/peer review to ensure I am using the Hudi Java classes and methods in the most appropriate manner.

Goal: Retrieve the timestamp of the latest completed commit in a Hudi table, loading only Hudi metadata files from S3 in the process.

Sample code below in the To Reproduce is the approach I am using to accomplish this goal in a PySpark ETL script via HoodieTableMetaClient.

The overall idea is to save the timestamp of the latest completed commit on the source Hudi table as a bookmark so the next ETL script run can process only the incremental changes after that point.

General questions:

Is this approach valid? If not, what alternative do you suggest?
Do the Hudi classes and methods I use have relatively stable public interfaces that are not likely to change significantly over time?
As development progresses, are there any plans to expose parts of Hudi’s API via Python?

I appreciate your time and expertise! Thanks for creating and maintaining this incredible framework!

To Reproduce

Sample code:

# sc already exists within the PySpark session.
source_path = "s3a://example-bucket/example-table/"
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
client = (
    sc._jvm
    .org.apache.hudi.common.table.HoodieTableMetaClient
    .builder()
    .setConf(sc._jsc.hadoopConfiguration())
    .setBasePath(source_path)
    .build()
)
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
timeline = client.getCommitsTimeline().filterCompletedInstants()
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/util/Option.java
# https://github.com/apache/hudi/blob/release-0.9.0/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
last_instant = timeline.lastInstant().orElse(None)
if last_instant:
    last_processed = last_instant.getTimestamp()

Environment Description

Hudi version : 0.9.0
Spark version : 3.1.1
Hive version : 2.3.7
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) : S3A
Running on Docker? (yes/no) : yes

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

xushiyancommented, Sep 29, 2021

Ok @bryanburke i think your approach is valid. metaclient APIs should be quite stable and even if in case of change, there should be a deprecation period to allow transition. You may also consider this example to get the latest commit

https://hudi.apache.org/docs/quick-start-guide#incremental-query

commits = list(map(lambda row: row[0], spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").limit(50).collect()))

As for more python API support, we don’t have this ranked up high in the roadmap. If you’re keen, please feel free to drive this feature. You could start by sending a [DISCUSS] email in the dev email list to gather more inputs. Thanks for illustrating your ideas. Closing this now. Feel free to follow up here or through email list.

0reactions

bryanburkecommented, Sep 28, 2021

@xushiyan Thank you for your response! I had no idea Hudi provides event-driven features, so your suggestions are helping me learn quite a bit more about the framework. While I do not believe we have a use case currently for SourceCommitCallback and S3EventsSource (see below), we may in the future if we start processing streaming data or require a long-running Spark cluster.

Please check out org.apache.hudi.utilities.callback.SourceCommitCallback and its implementing classes. This would allow you to trigger downstream jobs or logic to run.

Reading over my original post above, I believe I did a somewhat poor job defining our exact use case. I can provide some more details for context:

We process data in batches on a schedule.
We do not have a long-running Spark cluster.
ETL jobs run on transient Spark clusters (e.g., Amazon EMR/AWS Glue).
By the time a downstream job runs, the cluster that ran the prerequisite job does not necessarily exist anymore.

Given the above, I do not believe SourceCommitCallback meets our use case, as the downstream job that reads the Hudi table in S3 runs on a separate schedule and (most likely) a completely different transient Spark cluster. However, please feel free to provide additional insight if I am missing something.

what Hudi APIs you referring to? you can do all the actions through PySpark

Regarding my other question about exposing Hudi APIs via Python, I am referring to HoodieTableMetaClient itself and associated timeline classes (for example, as requested in HUDI-1998). However, I suppose the question could also extend to the configuration classes like HoodieWriteConfig and DataSourceWriteOptions that the 0.9.0 release notes indicate are now preferable over string variables.

Top Results From Across the Web

[GitHub] [hudi] xushiyan closed issue #3641 - The Mail Archive

[GitHub] [hudi] xushiyan closed issue #3641: [SUPPORT] Retrieving latest completed commit timestamp via HoodieTableMetaClient in PySpark.

Administering Hudi Pipelines - Apache Hudi

Administering via the Admin CLI; Graphite metrics; Spark UI of the Hudi ... HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from .

PySpark - How to Get Current Date & Timestamp

Now see how to format the current date & timestamp into a custom format using date patterns. PySpark supports all patterns supports on...

com.uber.hoodie.common.table.HoodieTimeline.findInstantsAfter ...

Get a list of instant times that have occurred, from the given instant timestamp. ... HoodieTableMetaClient metadata = new HoodieTableMetaClient(fs.

com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer.java ...

HoodieTableMetaClient ; import com.uber.hoodie.common.table. ... private transient FileSystem fs; /** * Timeline with completed commits */ private transient ...