[SUPPORT] Compaction fails with "java.io.FileNotFoundException"
See original GitHub issueDescribe the problem you faced
We are using Hudi 0.5.3 patched with https://github.com/apache/hudi/pull/1765, so that a compaction that previously failed is retried before new compactions.
When the compaction is retried, it fails with βjava.io.FileNotFoundExceptionβ.
To Reproduce
Iβm sorry, but I currently donβt have a simple way to reproduce this problem.
Here is how I got this error:
- Initialize a Hudi table using spark and βbulk insertβ
- Launch a spark structured streaming application that consumes messages from Kafka and saves them to Hudi, using βupsertβ
Expected behavior
Compaction should not fail.
Environment Description
-
Hudi version : 0.5.3 patched with https://github.com/apache/hudi/pull/1765
-
Spark version : 2.4.4 (EMR 6.0.0)
-
Hive version : 3.1.2
-
Hadoop version : 3.2.1
-
Storage (HDFS/S3/GCSβ¦) : S3
-
Running on Docker? (yes/no) : no
Additional context
- the throughput is around 15 messages per second.
- the Hudi table has around 20 partitions.
- there are no external processes that delete files from s3.
- the structured streaming job is run every 5 minutes with the following properties:
Map(
"hoodie.upsert.shuffle.parallelism" -> "200",
"hoodie.compact.inline" -> "true",
"hoodie.compact.inline.max.delta.commits" -> "1",
"hoodie.filesystem.view.incr.timeline.sync.enable":"true",
HIVE_SYNC_ENABLED_OPT_KEY -> "true",
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL,
OPERATION_OPT_KEY -> UPSERT_OPERATION_OPT_VAL,
CLEANER_INCREMENTAL_MODE -> "true",
CLEANER_POLICY_PROP -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name(),
CLEANER_FILE_VERSIONS_RETAINED_PROP -> 12,
)
Output of compactions show all
with Hudi CLI:
βββββββββββββββββββββββββββ€ββββββββββββ€ββββββββββββββββββββββββββββββββ
β Compaction Instant Time β State β Total FileIds to be Compacted β
β ββββββββββββββββββββββββββͺββββββββββββͺββββββββββββββββββββββββββββββββ£
β 20200821154520 β INFLIGHT β 57 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821153748 β COMPLETED β 56 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821152906 β COMPLETED β 50 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821152207 β COMPLETED β 52 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821151547 β COMPLETED β 57 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821151014 β COMPLETED β 48 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821150425 β COMPLETED β 54 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821145904 β COMPLETED β 49 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821145253 β COMPLETED β 60 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821144717 β COMPLETED β 55 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821144125 β COMPLETED β 59 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821143533 β COMPLETED β 56 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821142949 β COMPLETED β 55 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821142335 β COMPLETED β 59 β
βββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββββββββββββββββ’
β 20200821141741 β COMPLETED β 63 β
βββββββββββββββββββββββββββ§ββββββββββββ§ββββββββββββββββββββββββββββββββ
Output of cleans show
with Hudi CLI:
ββββββββββββββββββ€ββββββββββββββββββββββββββ€ββββββββββββββββββββββ€βββββββββββββββββββ
β CleanTime β EarliestCommandRetained β Total Files Deleted β Total Time Taken β
β βββββββββββββββββͺββββββββββββββββββββββββββͺββββββββββββββββββββββͺβββββββββββββββββββ£
β 20200821152814 β β 619 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821152115 β β 24 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821151459 β β 4 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821150921 β β 6 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821150334 β β 97 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821145815 β β 192 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821145201 β β 128 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821144630 β β 24 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821144033 β β 14 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821143441 β β 28 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821142858 β β 114 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821142242 β β 614 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821141650 β β 79 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821141111 β β 12 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821140501 β β 38 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821135933 β β 8 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821135412 β β 147 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821134904 β β 99 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821134339 β β 77 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821133821 β β 41 β -1 β
ββββββββββββββββββΌββββββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ’
β 20200821133227 β β 1 β -1 β
ββββββββββββββββββ§ββββββββββββββββββββββββββ§ββββββββββββββββββββββ§βββββββββββββββββββ
Stacktrace
20/08/24 03:55:31 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
20/08/24 03:57:48 ERROR HoodieMergeOnReadTable: Rolling back instant [==>20200821154520__compaction__INFLIGHT]
20/08/24 03:58:03 WARN HoodieCopyOnWriteTable: Rollback finished without deleting inflight instant file. Instant=[==>20200821154520__compaction__INFLIGHT]
20/08/24 03:58:33 WARN TaskSetManager: Lost task 7.0 in stage 39.0 (TID 2576, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
... 26 more
20/08/24 03:58:49 WARN TaskSetManager: Lost task 7.3 in stage 39.0 (TID 2582, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
... 26 more
20/08/24 03:58:49 ERROR TaskSetManager: Task 7 in stage 39.0 failed 4 times; aborting job
20/08/24 03:58:49 ERROR MicroBatchExecution: Query [id = 418bbb3a-3def-4a20-987b-2ac7a0ca7004, runId = ff16cb78-6247-413f-bd94-afd1c3ef48ed] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 39.0 failed 4 times, most recent failure: Lost task 7.3 in stage 39.0 (TID 2582, ip-xxx-xxx-xxx-xxx.ap-northeast-1.compute.internal, executor 1): org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
... 26 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at org.apache.hudi.client.HoodieWriteClient.doCompactionCommit(HoodieWriteClient.java:1134)
at org.apache.hudi.client.HoodieWriteClient.commitCompaction(HoodieWriteClient.java:1102)
at org.apache.hudi.client.HoodieWriteClient.runCompaction(HoodieWriteClient.java:1085)
at org.apache.hudi.client.HoodieWriteClient.compact(HoodieWriteClient.java:1056)
at org.apache.hudi.client.HoodieWriteClient.lambda$runEarlierInflightCompactions$3(HoodieWriteClient.java:524)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.hudi.client.HoodieWriteClient.runEarlierInflightCompactions(HoodieWriteClient.java:521)
at org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:501)
at org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:157)
at org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:101)
at org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:92)
at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:268)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:188)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at aaa.dataprocessor.writer.EventsWriter$.saveToHudiTable(EventsWriter.scala:145)
at aaa.dataprocessor.MainProcessor$.processBatch(MainProcessor.scala:162)
at aaa.dataprocessor.MainProcessor$.$anonfun$main$4(MainProcessor.scala:90)
at aaa.dataprocessor.MainProcessor$.$anonfun$main$4$adapted(MainProcessor.scala:82)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:84)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$14(MicroBatchExecution.scala:536)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:349)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:198)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:349)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.hudi.exception.HoodieException: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:207)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:190)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:139)
at org.apache.hudi.table.compact.HoodieMergeOnReadTableCompactor.lambda$compact$644ebad7$1(HoodieMergeOnReadTableCompactor.java:98)
at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1182)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://myBucket/absolute_path_to/daas_date=2020-05/0c376059-0279-4967-8002-70c3cd9c6b8e-0_6-3401-224110_20200821153748.parquet'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:617)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:202)
... 26 more
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (9 by maintainers)
Top GitHub Comments
@zherenyu831 @dm-tran : Good catch about incremental timeline syncing. This is an experimental feature still and is disabled by default. There could be a bug here. I will investigate further and have raised a blocker for next release : https://issues.apache.org/jira/browse/HUDI-1275
Please set this property to false for now. Also, Please use βcompaction unscheduleβ CLI command to revert compactions. Deleting inflight/requested compaction files is not safe.
@zherenyu831 Seems like the issue is resolved with setting the config to false. We will debug the issue as opened by @bvaradar. Closing this ticket.