Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] - AWS EMR and Glue Catalog

See original GitHub issue

When loading a hudi table in the AWS Glue data catalog and then sending a data update via Spark, when reading the table again from spark, the history of the data appears and not just one.

How can I solve that it brings me the last updated data.

Steps to reproduce the behavior:

Load Data:

hudiOptions = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'period', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database':database_name, 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.partition_fields': 'period', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }

CLIENT.write \ .format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'insert') \ .options(**hudiOptions) \ .mode('overwrite') \ .save('s3a://'+bucket_name+'/'+table_name)

read and update one row in table hudi

client= spark.sql("select * from table_name where id=59") updateDF = client.withColumn("cod_estado", when(client.cod_estado.isNull(), lit('1')).otherwise(lit(None)))

updateDF.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'upsert').options(**hudiOptions) .mode('append') .save('s3a://'+bucket_name+'/'+table_name)

Query in Athena —> OK

Hudi_1

Read Parquet —> OK

Hudi_2

Query in EMR from Catalog Glue --> NOK

Hudi_3

Expected behavior

Query in EMR from Catalog only show the last data.

Environment Description

EMR: emr-6.4.0
Hudi version : 0.8.0-amzn-0
Spark version :3.1.2
Hive version :3.1.2
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no

Issue Analytics

State:
Created a year ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

JosefinaArayaTapiacommented, May 18, 2022

Hi @xushiyan

I have presented the case to aws support and they sent me the following configuration which solved my problem. Also now use EMR 6.4.0

#NewOptions - Change here is used ComplexKeyGenerator instead of SImpleKeyGenerator, and used more than one column in recordkeyfield

hudiOptions = {
'hoodie.datasource.write.precombine.field':'last_update_time',
'hoodie.datasource.write.recordkey.field': 'id,creation_date', 
'hoodie.table.name': 'newhuditest0439', 
'hoodie.datasource.hive_sync.mode':'hms', 
'hoodie.datasource.write.hive_style_partitioning':'true', 
'hoodie.compact.inline.max.delta.commits':1, 
'hoodie.compact.inline.trigger.strategy':'NUM_COMMITS', 
'hoodie.datasource.compaction.async.enable':'false', 
'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 
'hoodie.index.type':'GLOBAL_BLOOM', 
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.keygen.ComplexKeyGenerator', 
'hoodie.bloom.index.filter.type':'DYNAMIC_V0', 
'hoodie.bloom.index.update.partition.path': 'false', 
'hoodie.datasource.hive_sync.table':'newhuditest0439', 
'hoodie.datasource.hive_sync.enable':'true', 
'hoodie.datasource.write.partitionpath.field':'creation_date', 
'hoodie.datasource.hive_sync.partition_fields':'creation_date', 
'hoodie.datasource.hive_sync.database':'default', 
'hoodie.datasource.hive_sync.support_timestamp': 'true'
}

0reactions

Gatsby-Leecommented, Sep 2, 2022

For anyone who get here,

if you have this issue, then you can find what you need from this link. https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog

I tested in EMR on EKS ( emr 6.7 ) + Hudi 0.10.1