question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] HiveSyncTool: missing partitions

See original GitHub issue

Describe the problem you faced

We have some IoT data tables with a few thousands of partitions; typically deviceId/year/month/day. We do not sync to hive every commit, but at regular intervals. For one of these tables I added a few months of historic data for an additional set of devices, as opposed to daily updates for the existing set. Somehow hive syncing with HiveSyncTool afterwards must have gone wrong (unfortunately do not have logs, so not sure if it failed or passed silently without detecting some partitions (suspect the latter)) because not all these partitions are present in hive. If I now run HiveSyncTool again, I just get e.g. Last commit time synced is 20220802000054258, Getting commits since then, which is what it does; it then picks up added partitions since that commit, but the ones that were not synced before are never added.

My current way of solving this is dropping the hive table and rerun HiveSyncTool from scratch. This adds all the partitions.

Steps to reproduce the behavior:

  1. Have a dataset with a large number of partitions deviceId/year/month/day (MultiPartKeysValueExtractor), sync to hive the first time. All is fine though it may take a long time
  2. Adding data to the existing partitions (new months/days will be added), syncing to hive still works
  3. Add a large amount of data for devices that were not in the set before, sync again -> in my case there are partitions for every new device, but lots of the underlying date partitions are missing.
  4. drop hive table and resync from scratch -> all partitions are there.

Expected behavior I would expect to either get an error if partitions are not synced, so I do not get an updated last commit time synced or to have them all detected immediately

Environment Description

  • Hudi version : 0.10.0

  • Spark version : 3.1.2

  • Hive version : client side: 2.3.7 through hudi, standalone metastore 3.0

  • Hadoop version : 3.2.0

  • Storage (HDFS/S3/GCS…) : Azure Data Lake Gen 2

  • Running on Docker? (yes/no) : yes (k8s)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Aug 27, 2022

one possible reason why you are seeing this. I assume you are running hive sync as a standalone job and not along w/ your regular writes. So, in such cases, hive sync will only consider commits in active timeline.

for eg, lets say you have 10 commits and ran hive sync to sync everything. and things are in good shape. now you add 100 more commits. your cleaner and archival configs are such that, only last 20 commits are in active timeline. Now, if you run hive sync again, hudi might sync partitions added only in the last 20 commits and not 100.

Is there a chance this is happening in your case?

0reactions
matthiasdgcommented, Aug 29, 2022

Yup, that’s more or less what I thought was the case. (Not sure what hudi setting exactly manages it, I mentioned commit retention in one of the earlier comments, but could be something else). Is this behavior and the settings that impact it documented somewhere? It’s not a big deal, just something to take into consideration

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to find missing partitions on a hive table.?
Hi, I have a Hive Table partitioned by process_dt . so if my data ingestion process is creating a partition - 207722.
Read more >
Troubleshooting - Apache Hudi
This occurs because HiveSyncTool currently supports only few compatible data type conversions. Doing any other incompatible change will throw this exception.
Read more >
commits - The Mail Archive
... [SUPPORT] HiveSyncTool: missing partitions GitBox; 2022/08/16 [GitHub] [hudi] With-winds opened a new issue, #6407: [SUPPORT] HoodieTableFileSystemView.
Read more >
Hudi on Hops - DiVA Portal
The purpose of the hive sync tool is for reconciling the Hudi table partitions with the hive partitions so basically it's for schema...
Read more >
hive 0.13 msck repair table only lists partitions not in metastore
... I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION , but after...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found