[SUPPORT] Hudi Sync did not add previous partitions
See original GitHub issueTips before filing an issue
Describe the problem you faced
Hudi HiveSyncTool did not add some partitions in Hive, and just added the newest partition
Expected behavior
It should have added all the partitions
Environment Description
-
Hudi version : 0.9.0
-
Spark version : 2.4.4
-
Hive version : 3.2.1
-
Hadoop version : 3.1.1
-
Storage (HDFS/S3/GCS…) : Azure
-
Running on Docker? (yes/no) : K8s
Additional context
I run this sync job every day using cron. It did not run for some reasons on 27th and 28th. When I reran the job on 29th, it just added 29th partition, leaving behind the partitions for day 27 and 28. Previous partitions were already created.
Stacktrace
- Data and Partitions after day-29 run
Found 12 items
drwxr-xr-x - root supergroup 0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/.hoodie
drwxr-xr-x - root supergroup 0 2021-10-22 10:11 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-19
drwxr-xr-x - root supergroup 0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-20
drwxr-xr-x - root supergroup 0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-21
drwxr-xr-x - root supergroup 0 2021-10-22 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-22
drwxr-xr-x - root supergroup 0 2021-10-23 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-23
drwxr-xr-x - root supergroup 0 2021-10-24 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-24
drwxr-xr-x - root supergroup 0 2021-10-25 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-25
drwxr-xr-x - root supergroup 0 2021-10-26 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-26
drwxr-xr-x - root supergroup 0 2021-10-27 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-27
drwxr-xr-x - root supergroup 0 2021-10-29 08:21 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-28
drwxr-xr-x - root supergroup 0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-29
hive> show partitions dp_hmi_quectel_test_data_packet_v2;
OK
dt=2021-10-19
dt=2021-10-20
dt=2021-10-21
dt=2021-10-22
dt=2021-10-23
dt=2021-10-24
dt=2021-10-25
dt=2021-10-26
dt=2021-10-29
- Day-29 Job logs
2021-10-29 10:46:02,513 INFO [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(190)) - Last commit time synced was found to be 20211026005933
2021-10-29 10:46:02,513 INFO [main] common.AbstractSyncHoodieClient (AbstractSyncHoodieClient.java:getPartitionsWrittenToSince(162)) - Last commit time synced is 20211026005933, Getting commits since then
2021-10-29 10:46:03,070 INFO [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(192)) - Storage partitions scan complete. Found 1
2021-10-29 10:46:03,070 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(895)) - 0: get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
2021-10-29 10:46:03,071 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(347)) - ugi=root ip=unknown-ip-addr cmd=get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
2021-10-29 10:46:03,104 INFO [main] hive.HiveSyncTool (HiveSyncTool.java:syncPartitions(333)) - New Partitions [dt=2021-10-29]
2021-10-29 10:46:03,104 INFO [main] ddl.HMSDDLExecutor (HMSDDLExecutor.java:addPartitionsToTable(181)) - Adding partitions 1 to table dp_hmi_quectel_imu_data_packet_v2
Is this because the older commits got archived and the sync job was not able to find the commit files from 27 and 28 and hence it could not detect the partition paths?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
closing the github issue. we will work on fixing it.
https://issues.apache.org/jira/browse/HUDI-3068