question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] - AWS EMR and Glue Catalog

See original GitHub issue

When loading a hudi table in the AWS Glue data catalog and then sending a data update via Spark, when reading the table again from spark, the history of the data appears and not just one.

How can I solve that it brings me the last updated data.

Steps to reproduce the behavior:

  1. Load Data:

hudiOptions = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'period', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database':database_name, 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.partition_fields': 'period', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }

CLIENT.write \ .format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'insert') \ .options(**hudiOptions) \ .mode('overwrite') \ .save('s3a://'+bucket_name+'/'+table_name)

  1. read and update one row in table hudi

client= spark.sql("select * from table_name where id=59") updateDF = client.withColumn("cod_estado", when(client.cod_estado.isNull(), lit('1')).otherwise(lit(None)))

updateDF.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'upsert').options(**hudiOptions) .mode('append') .save('s3a://'+bucket_name+'/'+table_name)

  1. Query in Athena —> OK

Hudi_1

  1. Read Parquet —> OK

Hudi_2

  1. Query in EMR from Catalog Glue --> NOK

Hudi_3

Expected behavior

Query in EMR from Catalog only show the last data.

Environment Description

  • EMR: emr-6.4.0

  • Hudi version : 0.8.0-amzn-0

  • Spark version :3.1.2

  • Hive version :3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
JosefinaArayaTapiacommented, May 18, 2022

Hi @xushiyan

I have presented the case to aws support and they sent me the following configuration which solved my problem. Also now use EMR 6.4.0

#NewOptions - Change here is used ComplexKeyGenerator instead of SImpleKeyGenerator, and used more than one column in recordkeyfield

hudiOptions = {
'hoodie.datasource.write.precombine.field':'last_update_time',
'hoodie.datasource.write.recordkey.field': 'id,creation_date', 
'hoodie.table.name': 'newhuditest0439', 
'hoodie.datasource.hive_sync.mode':'hms', 
'hoodie.datasource.write.hive_style_partitioning':'true', 
'hoodie.compact.inline.max.delta.commits':1, 
'hoodie.compact.inline.trigger.strategy':'NUM_COMMITS', 
'hoodie.datasource.compaction.async.enable':'false', 
'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 
'hoodie.index.type':'GLOBAL_BLOOM', 
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator', 
'hoodie.bloom.index.filter.type':'DYNAMIC_V0', 
'hoodie.bloom.index.update.partition.path': 'false', 
'hoodie.datasource.hive_sync.table':'newhuditest0439', 
'hoodie.datasource.hive_sync.enable':'true', 
'hoodie.datasource.write.partitionpath.field':'creation_date', 
'hoodie.datasource.hive_sync.partition_fields':'creation_date', 
'hoodie.datasource.hive_sync.database':'default', 
'hoodie.datasource.hive_sync.support_timestamp': 'true'
} 

0reactions
Gatsby-Leecommented, Sep 2, 2022

For anyone who get here,

if you have this issue, then you can find what you need from this link. https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog

I tested in EMR on EKS ( emr 6.7 ) + Hudi 0.10.1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use the AWS Glue Data Catalog as the metastore for Spark SQL
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...
Read more >
Using the AWS Glue Data Catalog as the metastore for Hive
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...
Read more >
Use resource-based policies for Amazon EMR access to AWS ...
If you use AWS Glue in conjunction with Hive, Spark, or Presto in Amazon EMR, AWS Glue supports resource-based policies to control access...
Read more >
Using Presto with the AWS Glue Data Catalog - Amazon EMR
The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR...
Read more >
Query an AWS Glue Data Catlog that's in another account with ...
I want to access and query another account's AWS Glue Data Catalog using Apache Hive and Apache Spark in Amazon EMR.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found