question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Read record using index

See original GitHub issue

Describe the problem you faced

I would like to read records from a Hudi table using the record key, to avoid having to scan the entire table.

I’ve read through the examples on how to query a Hudi table and the Spark Datasource mentions read(keys) but it’s very unclear on how to apply this when using PySpark.

What I am doing is reading data from a source table (non hudi), then transforming it and writing it to a target Hudi table. Sometimes this involves updating existing records in the target, but the merge logic is less than trivial. So the approach I am taking is:

  1. Read new rows from source => df1
  2. Read rows to be updated from target (this is where reading by record key would help) => df2
  3. Union df1 & df2, transform the data => transformed_df
  4. Upsert target using transformed_df

Expected behavior

Read from Hudi table using record keys.

Environment Description

  • Hudi version : 0.5.3

  • Spark version : 2.4.3 (using AWS Glue 2.0, PySpark)

  • Hive version : AWS Glue Catalog

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) :

  • Running on Docker? (yes/no) :

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
n3nashcommented, May 24, 2021

@calleo Hudi allows you to write custom merge logic at a record level so you don’t have to read the target table. Instead, you can just provide the input from the source non-hudi table, define your merge logic and let Hudi ensure to merge the incoming and on-disk data using the logic you have defined.

One way to do this is to implement the HoodieRecordPayload in scala or java and combine it with the bundle hudi jars or drop this in with your spark classpath.

You will also need to set this as the custom merge logic class here -> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L100

0reactions
calleocommented, Jun 12, 2021

Will give this a try. Thanks all for helping out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

READ TABLE with INDEX - SAP Community
Read table will get the first record that saitisfies the condition specified in the with key statement into the header. The sy-tabix will...
Read more >
How Does Indexing Work | Tutorial by Chartio
Indexing is the way to get an unordered table into an order that will maximize the query's efficiency while searching. Here we will...
Read more >
Create and use an index to improve performance
You can use an index to help Access find and sort records faster. An index stores the location of records based on the...
Read more >
Read data with index | Cloud Spanner
Read data by using an index. Explore further. For detailed documentation that includes this code sample, see the following: Getting started with Spanner...
Read more >
SQL index overview and strategy - SQLShack
A SQL index is used to retrieve data from a database very fast. Indexing a table or view is, without a doubt, one...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found