question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] When querying a hudi table in hive, there have duplicated records.

See original GitHub issue

Describe the problem you faced

When querying a hudi table in hive, there have duplicated records.

This hudi table is created by flink.

To Reproduce

Steps to reproduce the behavior:

  1. submit a flink job flink-sql-client -f mysql_table_sink.sql

the sql file content:

create table `mysql_table_kafka` (
  `id` bigint,
  `create_time` timestamp,
  `update_time` timestamp
)
with (
'connector' = 'kafka',
'topic' = 'mysql_table_kafka',
'properties.bootstrap.servers' = 'x.x.x.x:9092,x.x.x.x:9092,x.x.x.x:9092',
'properties.group.id' = 'mysql_table_cg',
'format' = 'canal-json',
'scan.startup.mode' = 'latest-offset'
);

create table `mysql_table_sink_new` (
  `id` bigint,
  `create_time` bigint,
  `update_time` bigint
  `dt` varchar(20)
) partitioned by (`dt`)
with(
'connector' = 'hudi'
,'path' = 'hdfs://nameservice1/hudi/mysql_table_sink_new'
,'hoodie.datasource.write.recordkey.field' = 'id'
,'write.precombine.field' = 'create_time'
,'write.tasks' = '1'
,'compaction.tasks' = '1'
,'read.streaming.enabled' = 'true'
,'read.streaming.check-interval' = '60'
,'table.type' = 'MERGE_ON_READ'
,'compaction.async.enabled' = 'true'
,'compaction.trigger.strategy' = 'num_commits'
,'compaction.delta_commits' = '2'
,'hive_sync.enable' = 'true'
,'hive_sync.mode' = 'hms'
,'hive_sync.metastore.uris' = 'thrift://x.x.x.x:9083'
,'hive_sync.jdbc_url' = 'jdbc:hive2://x.x.x.x:10000'
,'hive_sync.table' = 'mysql_table_sink_new'
,'hive_sync.db' = 'hv_ods'
);

insert into mysql_table_sink_new select
id
,unix_timestamp(date_format(create_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,unix_timestamp(date_format(update_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,date_format(create_time, 'yyyyMMdd') as dt from mysql_table_kafka;

2.query in beeline

select
m
,sum(case when cnt=1 then 1 else 0 end) as one_cnt
,sum(case when cnt=2 then 1 else 0 end) as two_cnt
,sum(case when cnt=3 then 1 else 0 end) as three_cnt
,sum(case when cnt=4 then 1 else 0 end) as four_cnt
from(
select id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm') as m,count(1) as cnt
from mysql_table_sink_new_ro
where dt='20220117'
group by id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm')
)t
group by m;
+-------------------+----------+----------+------------+-----------+--+
|         m         | one_cnt  | two_cnt  | three_cnt  | four_cnt  |
+-------------------+----------+----------+------------+-----------+--+
| 2022-01-17 16:07  | 0        | 0        | 0          | 5         |
| 2022-01-17 16:08  | 0        | 0        | 0          | 273       |
| 2022-01-17 16:09  | 0        | 0        | 37         | 241       |
| 2022-01-17 16:10  | 0        | 0        | 340        | 0         |
| 2022-01-17 16:11  | 0        | 21       | 239        | 0         |
| 2022-01-17 16:12  | 0        | 253      | 0          | 0         |
| 2022-01-17 16:13  | 38       | 261      | 0          | 0         |
| 2022-01-17 16:14  | 283      | 0        | 0          | 0         |
| 2022-01-17 16:15  | 247      | 0        | 0          | 0         |
+-------------------+----------+----------+------------+-----------+--+
select id,count(1) from mysql_table_sink_new_ro group by id having count(1)>1 limit 10;
+------------+------+--+
|     id     | _c1  |
+------------+------+--+
| 413588661  | 5    |
| 413588664  | 5    |
| 413588667  | 5    |
| 413588670  | 5    |
| 413588673  | 5    |
| 413588676  | 5    |
| 413588679  | 5    |
| 413588682  | 5    |
| 413588685  | 5    |
| 413588688  | 5    |
+------------+------+--+
select `_hoodie_commit_time`,`_hoodie_commit_seqno`,`_hoodie_record_key`,`_hoodie_partition_path`,`_hoodie_file_name`,id,create_time,update_time from mysql_table_sink_new_ro where id='413588660';
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| _hoodie_commit_time  | _hoodie_commit_seqno  | _hoodie_record_key  | _hoodie_partition_path  |                 _hoodie_file_name                  |     id     |  create_time   |  update_time   |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.10.0

  • Spark version : xxx

  • Hive version : 1.1.0-cdh5.13.3

  • Hadoop version : 2.6.0-cdh5.13.3

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

  • Flink version : 1.13.3

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
xiarixiaoyaocommented, Jan 18, 2022

@ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat

if you have wechat?we can communicate directly through wechat

0reactions
wwli05commented, Nov 3, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting duplicate records while querying Hudi table using ...
I have inserted a few records and then updated the same using Hudi Merge on Read. This will internally create new files under...
Read more >
FAQs - Apache Hudi
What versions of Hive/Spark/Hadoop are support by Hudi?​ ... If you don't want duplicate records either issue an upsert or consider ...
Read more >
Querying Data - Apache Hudi
Conceptually, Hudi stores data physically once on DFS, while providing 3 different ways of querying, as explained before. Once the table is synced...
Read more >
Querying Hudi Tables - Apache Hudi
Once the Hudi tables have been registered to the Hive metastore, it can be queried using the Spark-Hive integration. It supports all query...
Read more >
Troubleshooting - Apache Hudi
Off the bat, the following metadata is added to every record to help ... have duplicates AFTER ensuring the query is accessing the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found