[SUPPORT] When querying a hudi table in hive, there have duplicated records.
See original GitHub issueDescribe the problem you faced
When querying a hudi table in hive, there have duplicated records.
This hudi table is created by flink.
To Reproduce
Steps to reproduce the behavior:
- submit a flink job flink-sql-client -f mysql_table_sink.sql
the sql file content:
create table `mysql_table_kafka` (
`id` bigint,
`create_time` timestamp,
`update_time` timestamp
)
with (
'connector' = 'kafka',
'topic' = 'mysql_table_kafka',
'properties.bootstrap.servers' = 'x.x.x.x:9092,x.x.x.x:9092,x.x.x.x:9092',
'properties.group.id' = 'mysql_table_cg',
'format' = 'canal-json',
'scan.startup.mode' = 'latest-offset'
);
create table `mysql_table_sink_new` (
`id` bigint,
`create_time` bigint,
`update_time` bigint
`dt` varchar(20)
) partitioned by (`dt`)
with(
'connector' = 'hudi'
,'path' = 'hdfs://nameservice1/hudi/mysql_table_sink_new'
,'hoodie.datasource.write.recordkey.field' = 'id'
,'write.precombine.field' = 'create_time'
,'write.tasks' = '1'
,'compaction.tasks' = '1'
,'read.streaming.enabled' = 'true'
,'read.streaming.check-interval' = '60'
,'table.type' = 'MERGE_ON_READ'
,'compaction.async.enabled' = 'true'
,'compaction.trigger.strategy' = 'num_commits'
,'compaction.delta_commits' = '2'
,'hive_sync.enable' = 'true'
,'hive_sync.mode' = 'hms'
,'hive_sync.metastore.uris' = 'thrift://x.x.x.x:9083'
,'hive_sync.jdbc_url' = 'jdbc:hive2://x.x.x.x:10000'
,'hive_sync.table' = 'mysql_table_sink_new'
,'hive_sync.db' = 'hv_ods'
);
insert into mysql_table_sink_new select
id
,unix_timestamp(date_format(create_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,unix_timestamp(date_format(update_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,date_format(create_time, 'yyyyMMdd') as dt from mysql_table_kafka;
2.query in beeline
select
m
,sum(case when cnt=1 then 1 else 0 end) as one_cnt
,sum(case when cnt=2 then 1 else 0 end) as two_cnt
,sum(case when cnt=3 then 1 else 0 end) as three_cnt
,sum(case when cnt=4 then 1 else 0 end) as four_cnt
from(
select id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm') as m,count(1) as cnt
from mysql_table_sink_new_ro
where dt='20220117'
group by id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm')
)t
group by m;
+-------------------+----------+----------+------------+-----------+--+
| m | one_cnt | two_cnt | three_cnt | four_cnt |
+-------------------+----------+----------+------------+-----------+--+
| 2022-01-17 16:07 | 0 | 0 | 0 | 5 |
| 2022-01-17 16:08 | 0 | 0 | 0 | 273 |
| 2022-01-17 16:09 | 0 | 0 | 37 | 241 |
| 2022-01-17 16:10 | 0 | 0 | 340 | 0 |
| 2022-01-17 16:11 | 0 | 21 | 239 | 0 |
| 2022-01-17 16:12 | 0 | 253 | 0 | 0 |
| 2022-01-17 16:13 | 38 | 261 | 0 | 0 |
| 2022-01-17 16:14 | 283 | 0 | 0 | 0 |
| 2022-01-17 16:15 | 247 | 0 | 0 | 0 |
+-------------------+----------+----------+------------+-----------+--+
select id,count(1) from mysql_table_sink_new_ro group by id having count(1)>1 limit 10;
+------------+------+--+
| id | _c1 |
+------------+------+--+
| 413588661 | 5 |
| 413588664 | 5 |
| 413588667 | 5 |
| 413588670 | 5 |
| 413588673 | 5 |
| 413588676 | 5 |
| 413588679 | 5 |
| 413588682 | 5 |
| 413588685 | 5 |
| 413588688 | 5 |
+------------+------+--+
select `_hoodie_commit_time`,`_hoodie_commit_seqno`,`_hoodie_record_key`,`_hoodie_partition_path`,`_hoodie_file_name`,id,create_time,update_time from mysql_table_sink_new_ro where id='413588660';
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | _hoodie_partition_path | _hoodie_file_name | id | create_time | update_time |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
| 20220117160954 | 20220117160954_0_482 | 413588661 | 20220117 | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661 | 1642406878000 | 1642406878000 |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.10.0
-
Spark version : xxx
-
Hive version : 1.1.0-cdh5.13.3
-
Hadoop version : 2.6.0-cdh5.13.3
-
Storage (HDFS/S3/GCS…) : HDFS
-
Running on Docker? (yes/no) : no
-
Flink version : 1.13.3
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
Getting duplicate records while querying Hudi table using ...
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under...
Read more >FAQs - Apache Hudi
What versions of Hive/Spark/Hadoop are support by Hudi? ... If you don't want duplicate records either issue an upsert or consider ...
Read more >Querying Data - Apache Hudi
Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained before. Once the table is synced...
Read more >Querying Hudi Tables - Apache Hudi
Once the Hudi tables have been registered to the Hive metastore, it can be queried using the Spark-Hive integration. It supports all query...
Read more >Troubleshooting - Apache Hudi
Off the bat, the following metadata is added to every record to help ... have duplicates AFTER ensuring the query is accessing the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat
if you have wechat?we can communicate directly through wechat
@xiarixiaoyao i raised one issue: https://issues.apache.org/jira/browse/HUDI-5155