Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] When querying a hudi table in hive, there have duplicated records.

See original GitHub issue

Describe the problem you faced

When querying a hudi table in hive, there have duplicated records.

This hudi table is created by flink.

To Reproduce

Steps to reproduce the behavior:

submit a flink job flink-sql-client -f mysql_table_sink.sql

the sql file content:

create table `mysql_table_kafka` (
  `id` bigint,
  `create_time` timestamp,
  `update_time` timestamp
)
with (
'connector' = 'kafka',
'topic' = 'mysql_table_kafka',
'properties.bootstrap.servers' = 'x.x.x.x:9092,x.x.x.x:9092,x.x.x.x:9092',
'properties.group.id' = 'mysql_table_cg',
'format' = 'canal-json',
'scan.startup.mode' = 'latest-offset'
);

create table `mysql_table_sink_new` (
  `id` bigint,
  `create_time` bigint,
  `update_time` bigint
  `dt` varchar(20)
) partitioned by (`dt`)
with(
'connector' = 'hudi'
,'path' = 'hdfs://nameservice1/hudi/mysql_table_sink_new'
,'hoodie.datasource.write.recordkey.field' = 'id'
,'write.precombine.field' = 'create_time'
,'write.tasks' = '1'
,'compaction.tasks' = '1'
,'read.streaming.enabled' = 'true'
,'read.streaming.check-interval' = '60'
,'table.type' = 'MERGE_ON_READ'
,'compaction.async.enabled' = 'true'
,'compaction.trigger.strategy' = 'num_commits'
,'compaction.delta_commits' = '2'
,'hive_sync.enable' = 'true'
,'hive_sync.mode' = 'hms'
,'hive_sync.metastore.uris' = 'thrift://x.x.x.x:9083'
,'hive_sync.jdbc_url' = 'jdbc:hive2://x.x.x.x:10000'
,'hive_sync.table' = 'mysql_table_sink_new'
,'hive_sync.db' = 'hv_ods'
);

insert into mysql_table_sink_new select
id
,unix_timestamp(date_format(create_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,unix_timestamp(date_format(update_time, 'yyyy-MM-dd HH:mm:ss'))*1000
,date_format(create_time, 'yyyyMMdd') as dt from mysql_table_kafka;

2.query in beeline

select
m
,sum(case when cnt=1 then 1 else 0 end) as one_cnt
,sum(case when cnt=2 then 1 else 0 end) as two_cnt
,sum(case when cnt=3 then 1 else 0 end) as three_cnt
,sum(case when cnt=4 then 1 else 0 end) as four_cnt
from(
select id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm') as m,count(1) as cnt
from mysql_table_sink_new_ro
where dt='20220117'
group by id,from_unixtime(cast(create_time/1000 as bigint), 'yyyy-MM-dd HH:mm')
)t
group by m;
+-------------------+----------+----------+------------+-----------+--+
|         m         | one_cnt  | two_cnt  | three_cnt  | four_cnt  |
+-------------------+----------+----------+------------+-----------+--+
| 2022-01-17 16:07  | 0        | 0        | 0          | 5         |
| 2022-01-17 16:08  | 0        | 0        | 0          | 273       |
| 2022-01-17 16:09  | 0        | 0        | 37         | 241       |
| 2022-01-17 16:10  | 0        | 0        | 340        | 0         |
| 2022-01-17 16:11  | 0        | 21       | 239        | 0         |
| 2022-01-17 16:12  | 0        | 253      | 0          | 0         |
| 2022-01-17 16:13  | 38       | 261      | 0          | 0         |
| 2022-01-17 16:14  | 283      | 0        | 0          | 0         |
| 2022-01-17 16:15  | 247      | 0        | 0          | 0         |
+-------------------+----------+----------+------------+-----------+--+

select id,count(1) from mysql_table_sink_new_ro group by id having count(1)>1 limit 10;
+------------+------+--+
|     id     | _c1  |
+------------+------+--+
| 413588661  | 5    |
| 413588664  | 5    |
| 413588667  | 5    |
| 413588670  | 5    |
| 413588673  | 5    |
| 413588676  | 5    |
| 413588679  | 5    |
| 413588682  | 5    |
| 413588685  | 5    |
| 413588688  | 5    |
+------------+------+--+

select `_hoodie_commit_time`,`_hoodie_commit_seqno`,`_hoodie_record_key`,`_hoodie_partition_path`,`_hoodie_file_name`,id,create_time,update_time from mysql_table_sink_new_ro where id='413588660';
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| _hoodie_commit_time  | _hoodie_commit_seqno  | _hoodie_record_key  | _hoodie_partition_path  |                 _hoodie_file_name                  |     id     |  create_time   |  update_time   |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
| 20220117160954       | 20220117160954_0_482  | 413588661           | 20220117                | fd45d59f-eef8-402d-a63c-6e7e5cfb5f63_0-1-0_20220117160954.parquet | 413588661  | 1642406878000  | 1642406878000  |
+----------------------+-----------------------+---------------------+-------------------------+----------------------------------------------------+------------+----------------+----------------+--+

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.10.0
Spark version : xxx
Hive version : 1.1.0-cdh5.13.3
Hadoop version : 2.6.0-cdh5.13.3
Storage (HDFS/S3/GCS…) : HDFS
Running on Docker? (yes/no) : no
Flink version : 1.13.3

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Issue Analytics

State:
Created 2 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

xiarixiaoyaocommented, Jan 18, 2022

@ChangbingChen sorry i forget one things, before you use hive to query hoodie table, do you have set inputformat, eg: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat / or set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat

if you have wechat？we can communicate directly through wechat

0reactions

wwli05commented, Nov 3, 2022

@xiarixiaoyao i raised one issue: https://issues.apache.org/jira/browse/HUDI-5155