[SUPPORT] - AWS EMR and Glue Catalog
See original GitHub issueWhen loading a hudi table in the AWS Glue data catalog and then sending a data update via Spark, when reading the table again from spark, the history of the data appears and not just one.
How can I solve that it brings me the last updated data.
Steps to reproduce the behavior:
- Load Data:
hudiOptions = { 'hoodie.table.name': table_name, 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'period', 'hoodie.datasource.write.precombine.field': 'last_update_time', 'hoodie.datasource.write.table.type':'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database':database_name, 'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.partition_fields': 'period', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }
CLIENT.write \ .format('org.apache.hudi') \ .option('hoodie.datasource.write.operation', 'insert') \ .options(**hudiOptions) \ .mode('overwrite') \ .save('s3a://'+bucket_name+'/'+table_name)
- read and update one row in table hudi
client= spark.sql("select * from table_name where id=59")
updateDF = client.withColumn("cod_estado", when(client.cod_estado.isNull(), lit('1')).otherwise(lit(None)))
updateDF.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'upsert').options(**hudiOptions) .mode('append') .save('s3a://'+bucket_name+'/'+table_name)
- Query in Athena —> OK
- Read Parquet —> OK
- Query in EMR from Catalog Glue --> NOK
Expected behavior
Query in EMR from Catalog only show the last data.
Environment Description
-
EMR: emr-6.4.0
-
Hudi version : 0.8.0-amzn-0
-
Spark version :3.1.2
-
Hive version :3.1.2
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Hi @xushiyan
I have presented the case to aws support and they sent me the following configuration which solved my problem. Also now use EMR 6.4.0
For anyone who get here,
if you have this issue, then you can find what you need from this link. https://aws.github.io/aws-emr-containers-best-practices/metastore-integrations/docs/aws-glue/#sync-hudi-table-with-aws-glue-catalog
I tested in EMR on EKS ( emr 6.7 ) + Hudi 0.10.1