Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Read record using index

See original GitHub issue

Describe the problem you faced

I would like to read records from a Hudi table using the record key, to avoid having to scan the entire table.

I’ve read through the examples on how to query a Hudi table and the Spark Datasource mentions read(keys) but it’s very unclear on how to apply this when using PySpark.

What I am doing is reading data from a source table (non hudi), then transforming it and writing it to a target Hudi table. Sometimes this involves updating existing records in the target, but the merge logic is less than trivial. So the approach I am taking is:

Read new rows from source => df1
Read rows to be updated from target (this is where reading by record key would help) => df2
Union df1 & df2, transform the data => transformed_df
Upsert target using transformed_df

Expected behavior

Read from Hudi table using record keys.

Environment Description

Hudi version : 0.5.3
Spark version : 2.4.3 (using AWS Glue 2.0, PySpark)
Hive version : AWS Glue Catalog
Hadoop version :
Storage (HDFS/S3/GCS…) :
Running on Docker? (yes/no) :

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

n3nashcommented, May 24, 2021

@calleo Hudi allows you to write custom merge logic at a record level so you don’t have to read the target table. Instead, you can just provide the input from the source non-hudi table, define your merge logic and let Hudi ensure to merge the incoming and on-disk data using the logic you have defined.

One way to do this is to implement the HoodieRecordPayload in scala or java and combine it with the bundle hudi jars or drop this in with your spark classpath.

You will also need to set this as the custom merge logic class here -> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100

0reactions

calleocommented, Jun 12, 2021

Will give this a try. Thanks all for helping out.

Top Results From Across the Web

READ TABLE with INDEX - SAP Community

Read table will get the first record that saitisfies the condition specified in the with key statement into the header. The sy-tabix will...

How Does Indexing Work | Tutorial by Chartio

Indexing is the way to get an unordered table into an order that will maximize the query's efficiency while searching. Here we will...

Create and use an index to improve performance

You can use an index to help Access find and sort records faster. An index stores the location of records based on the...

Read data with index | Cloud Spanner

Read data by using an index. Explore further. For detailed documentation that includes this code sample, see the following: Getting started with Spanner...

SQL index overview and strategy - SQLShack

A SQL index is used to retrieve data from a database very fast. Indexing a table or view is, without a doubt, one...