[SUPPORT] Read record using index
See original GitHub issueDescribe the problem you faced
I would like to read records from a Hudi table using the record key, to avoid having to scan the entire table.
I’ve read through the examples on how to query a Hudi table and the Spark Datasource mentions read(keys)
but it’s very unclear on how to apply this when using PySpark.
What I am doing is reading data from a source table (non hudi), then transforming it and writing it to a target Hudi table. Sometimes this involves updating existing records in the target, but the merge logic is less than trivial. So the approach I am taking is:
- Read new rows from source => df1
- Read rows to be updated from target (this is where reading by record key would help) => df2
- Union df1 & df2, transform the data => transformed_df
- Upsert target using transformed_df
Expected behavior
Read from Hudi table using record keys.
Environment Description
-
Hudi version : 0.5.3
-
Spark version : 2.4.3 (using AWS Glue 2.0, PySpark)
-
Hive version : AWS Glue Catalog
-
Hadoop version :
-
Storage (HDFS/S3/GCS…) :
-
Running on Docker? (yes/no) :
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@calleo Hudi allows you to write custom merge logic at a record level so you don’t have to read the target table. Instead, you can just provide the input from the source non-hudi table, define your merge logic and let Hudi ensure to merge the incoming and on-disk data using the logic you have defined.
One way to do this is to implement the HoodieRecordPayload in scala or java and combine it with the bundle hudi jars or drop this in with your spark classpath.
You will also need to set this as the custom merge logic class here -> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100
Will give this a try. Thanks all for helping out.