Improve HiveSyncTool handling of empty commit timeline
See original GitHub issueHere is my use case:
I am using spark streaming to write data being received from Kafka to hoodie table, then I sync to hive table. And I using the non-partitioned hive table. I had set the KeyGenerator
with NonpartitionedKeyGenerator.class
. When I synced to hive, the following error occurred. What’s the reason causing this. Below is my error :
2019-01-02 10:04:02,511 ERROR scheduler.JobScheduler (Logging.scala:logError(91)) - Error running job streaming job 1546394640000 ms.0
java.lang.IllegalArgumentException: Could not find any data file written for commit [20190102100400__commit__COMPLETED], could not get schema for dataset hdfs://nns-off/databus/hudi/tables/databus_realtime_databus_realtime_databus_sub_hd_t_hudi_sub_hd, Metadata :HoodieCommitMetadata{partitionToWriteStats={}, compacted=false, extraMetadataMap={ROLLING_STAT={
"partitionToRollingStats" : {
"" : {
"9e163bc3-c14f-4c46-937a-67134b26f7e2" : {
"fileId" : "9e163bc3-c14f-4c46-937a-67134b26f7e2",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434448
},
"0ba4d519-6e8d-42f6-b27e-f027b45f5b06" : {
"fileId" : "0ba4d519-6e8d-42f6-b27e-f027b45f5b06",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"a4d2eaf3-6027-4093-9621-b40cc2ebcb8b" : {
"fileId" : "a4d2eaf3-6027-4093-9621-b40cc2ebcb8b",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"0571876c-78a6-4263-8976-c687ad2e0ba9" : {
"fileId" : "0571876c-78a6-4263-8976-c687ad2e0ba9",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434435
},
"321509ba-814e-4d9d-a135-dfe958490ec8" : {
"fileId" : "321509ba-814e-4d9d-a135-dfe958490ec8",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434435
},
"8c36a009-0042-40f5-abb3-91f891cec775" : {
"fileId" : "8c36a009-0042-40f5-abb3-91f891cec775",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434450
},
"15adf256-c64d-4dd9-9d45-4d911b5763ba" : {
"fileId" : "15adf256-c64d-4dd9-9d45-4d911b5763ba",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434435
},
"17ef27e0-dedc-4b8b-9997-da7bc65a61d1" : {
"fileId" : "17ef27e0-dedc-4b8b-9997-da7bc65a61d1",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"4139fed7-ce24-4e16-ba2f-9dd30b5a7100" : {
"fileId" : "4139fed7-ce24-4e16-ba2f-9dd30b5a7100",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"2284b3dc-4137-44b0-a71f-de5fbd14a7ff" : {
"fileId" : "2284b3dc-4137-44b0-a71f-de5fbd14a7ff",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"d10310dd-7b4a-4669-9523-e7e9ac01ff17" : {
"fileId" : "d10310dd-7b4a-4669-9523-e7e9ac01ff17",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434436
},
"68592d6b-82f7-4ea6-8cce-8b21a07d4b1a" : {
"fileId" : "68592d6b-82f7-4ea6-8cce-8b21a07d4b1a",
"inserts" : 1,
"upserts" : 0,
"deletes" : 0,
"totalInputWriteBytesToDisk" : 0,
"totalInputWriteBytesOnDisk" : 434441
}
}
},
"actionType" : "commit"
}}}
at com.uber.hoodie.hive.HoodieHiveClient.lambda$getDataSchema$1(HoodieHiveClient.java:317)
at java.util.Optional.orElseThrow(Optional.java:290)
at com.uber.hoodie.hive.HoodieHiveClient.getDataSchema(HoodieHiveClient.java:315)
at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at com.uber.hoodie.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:68)
at com.lianjia.dtarch.databus.streaming.hudi.service.HudiService.syncToHive(HudiService.java:79)
at com.lianjia.dtarch.databus.streaming.hudi.service.HudiService.writeWithCompactAndSync(HudiService.java:58)
at com.lianjia.dtarch.databus.streaming.hudi.KfkHudiConsumer.lambda$saveToHudi$c06d719c$1(KfkHudiConsumer.java:161)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
How to Push an Empty Commit in Git - freeCodeCamp
In this article, we will discuss how to push a commit in Git without making any changes. Git makes this process of pushing...
Read more >All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >Rebasing a git history with empty commit messages
To replace empty commit messages with some template, you can do something like this: git filter-branch -f --msg-filter ' read msg if [...
Read more >How (and why!) to keep your Git commit history clean - GitLab
Git commit history is very easy to mess up, here's how you can fix it! ... Note that empty commits are commented out....
Read more >7.6 Git Tools - Rewriting History
To modify a commit that is farther back in your history, you must move to more complex tools. ... Note that empty commits...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@NetsanetGeb Hudi is registered as an external table in Hive. Hudi controls the mechanism of how writes are done to a table (writing to an HDFS location) as well as manages schema evolution via avro and eventually registers a custom InputFormat to allow for snapshot reading of these tables. Thus, the data is controlled and managed by Hudi while the metadata (such as partitions) is managed by Hive. There shouldn’t be any difference in the hive meta store for Hudi tables but there are a few general differences between Managed vs External tables such as how drop partitions/tables work. Some details here : https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables
@NetsanetGeb Just seems like the job cannot talk to kafka? btw, do you mind posting this on the mailing list, since this seems like a separate issue? https://hudi.apache.org/community.html We are using the mailing list as the primary support channel now & can respond much quicker there