question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi Sync did not add previous partitions

See original GitHub issue

Tips before filing an issue

Describe the problem you faced

Hudi HiveSyncTool did not add some partitions in Hive, and just added the newest partition

Expected behavior

It should have added all the partitions

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 2.4.4

  • Hive version : 3.2.1

  • Hadoop version : 3.1.1

  • Storage (HDFS/S3/GCS…) : Azure

  • Running on Docker? (yes/no) : K8s

Additional context

I run this sync job every day using cron. It did not run for some reasons on 27th and 28th. When I reran the job on 29th, it just added 29th partition, leaving behind the partitions for day 27 and 28. Previous partitions were already created.

Stacktrace

  • Data and Partitions after day-29 run
Found 12 items
drwxr-xr-x   - root supergroup          0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/.hoodie
drwxr-xr-x   - root supergroup          0 2021-10-22 10:11 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-19
drwxr-xr-x   - root supergroup          0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-20
drwxr-xr-x   - root supergroup          0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-21
drwxr-xr-x   - root supergroup          0 2021-10-22 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-22
drwxr-xr-x   - root supergroup          0 2021-10-23 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-23
drwxr-xr-x   - root supergroup          0 2021-10-24 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-24
drwxr-xr-x   - root supergroup          0 2021-10-25 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-25
drwxr-xr-x   - root supergroup          0 2021-10-26 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-26
drwxr-xr-x   - root supergroup          0 2021-10-27 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-27
drwxr-xr-x   - root supergroup          0 2021-10-29 08:21 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-28
drwxr-xr-x   - root supergroup          0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-29

hive> show partitions dp_hmi_quectel_test_data_packet_v2;
OK
dt=2021-10-19
dt=2021-10-20
dt=2021-10-21
dt=2021-10-22
dt=2021-10-23
dt=2021-10-24
dt=2021-10-25
dt=2021-10-26
dt=2021-10-29
  • Day-29 Job logs
2021-10-29 10:46:02,513 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(190)) - Last commit time synced was found to be 20211026005933
2021-10-29 10:46:02,513 INFO  [main] common.AbstractSyncHoodieClient (AbstractSyncHoodieClient.java:getPartitionsWrittenToSince(162)) - Last commit time synced is 20211026005933, Getting commits since then
2021-10-29 10:46:03,070 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(192)) - Storage partitions scan complete. Found 1
2021-10-29 10:46:03,070 INFO  [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(895)) - 0: get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
2021-10-29 10:46:03,071 INFO  [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(347)) - ugi=root	ip=unknown-ip-addr	cmd=get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
2021-10-29 10:46:03,104 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncPartitions(333)) - New Partitions [dt=2021-10-29]
2021-10-29 10:46:03,104 INFO  [main] ddl.HMSDDLExecutor (HMSDDLExecutor.java:addPartitionsToTable(181)) - Adding partitions 1 to table dp_hmi_quectel_imu_data_packet_v2

Is this because the older commits got archived and the sync job was not able to find the commit files from 27 and 28 and hence it could not detect the partition paths?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Dec 21, 2021

closing the github issue. we will work on fixing it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting - Apache Hudi
First of all, please confirm if you do indeed have duplicates AFTER ensuring the query is accessing the Hudi table properly . If...
Read more >
Hive Metastore - Apache Hudi
Hive Sync Tool​ Writing data with DataSource writer or HoodieDeltaStreamer supports syncing of the table's latest schema to Hive metastore, such that queries...
Read more >
FAQs - Apache Hudi
As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...
Read more >
All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >
FAQs - Apache Hudi
Hudi is not designed for any OLTP use-cases, where typically you are using ... in upstream database causing lots of updates to old...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found