[SUPPORT] File names in S3 do not match the file names in the latest .commit file
See original GitHub issueTips before filing an issue
-
Have you gone through our FAQs?
- Yes
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
- Joined
-
If you have triaged this as a bug, then file an issue directly.
- Not sure if this is a bug, as it was hard to reproduce it
Describe the problem you faced
We have some hudi jobs which fail, throwing FileNotFoundException
while reading the parquet files from S3.
We observe this exception to be thrown for files whose names in S3 don’t match the names in the latest .commit
file. The file names in S3 and in the .commit file have matching fileId and instantTime but differ in their writeToken.
Details
We’ve a MoR table in hudi in which we are periodically upserting data. Compaction runs after 5 deltacommit(s). It had been running fine, but started to fail recently. From the logs, we see such exception. (full stacktrace is at the bottom)
java.io.FileNotFoundException: No such file or directory 's3://XXXX/date=2020/46465be4-73c8-42e7-9905-088e15e0b627-0_23-626-12975_20220429052025969.parquet'
Comparing the file names present in the latest commit via commit showfiles
and in the S3 directory via fsview latest
, we observe that some file names match while some don’t.
One such example of the difference in the file names File name from the commit file (20220429052025969.commit)
hudi> commit showfiles --commit 20220429052025969
date=2020/46465be4-73c8-42e7-9905-088e15e0b627-0_23-626-12975_20220429052025969.parquet
File present in S3
hudi> fsview latest --partitionPath date=2020
date=2020/46465be4-73c8-42e7-9905-088e15e0b627-0_23-627-13061_20220429052025969.parquet
This example file has different names in S3 and in the .commit file. The names have matching fileId and instantTime but differ in their writeToken.
The latest commit
From the .hoodie directory in S3, we see that the last successful commit was at the instant 20220429052025969
aws s3 ls s3://XXXX/wallet-XXXX/.hoodie/
...
2022-04-29 05:17:55 0 20220429051753856.deltacommit.requested
2022-04-29 05:19:43 25851 20220429051753856.deltacommit.inflight
2022-04-29 05:20:21 52061 20220429051753856.deltacommit
2022-04-29 05:20:31 0 20220429052025969.compaction.inflight
2022-04-29 05:20:31 34606 20220429052025969.compaction.requested
2022-04-29 05:30:14 59691 20220429052025969.commit
After this commit, we have a series of rollback as the job continued to fail with FileNotFoundException
2022-04-29 06:02:26 1230 20220429060225213.rollback.requested
2022-04-29 06:02:27 0 20220429060225213.rollback.inflight
2022-04-29 06:02:34 1531 20220429060225213.rollback
...
hudi> commits show
CommitTime | Total Bytes Written | Total Files Added | Total Files Updated | Total Partitions Written | Total Records Written | Total Update Records Written | Total Errors |
---|---|---|---|---|---|---|---|
20220429052025969 | 561.7 MB | 0 | 61 | 3 | 18297036 | 12586123 | 0 |
… |
hudi> compactions show all
Compaction Instant Time | State | Total FileIds to be Compacted |
---|---|---|
20220429052025969 | COMPLETED | 61 |
… |
We tried to run compaction repair
for the instant 20220429052025969
, but that didn’t help
Result of Repair Operation : <empty>
As we can see from the commits, no cleaner ran after the latest commit at 20220429052025969
. Also, there was no other pending compaction.
Expected behavior
- All the file names in the commit file must be the same as in the S3 directory.
- Or, is there any utility to synchronize the file names between the commit file and S3.
I’m not sure as what could have caused this issue, so any pointers or configs or insights will be helpful. I’ll be happy to share further information.
Environment Description
-
Hudi version : 0.10.1
-
Spark version : 3.1.2
-
Hive version : (not using hive in this pipeline, or hive is not affected in this pipeline)
-
Hadoop version : 3.3.1
-
Storage (HDFS/S3/GCS) : S3
-
Running on Docker? (yes/no) : no
Additional context
The table is Merge-on-Read with below properties:
Property | Value |
---|---|
basePath | s3://xxxx/wallet_db5/wallet-xxxx |
metaPath | s3://xxxx/wallet-xxxx/.hoodie |
fileSystem | s3 |
hoodie.compaction.payload.class | our_custom_payload_class |
hoodie.table.type | MERGE_ON_READ |
hoodie.table.precombine.field | xxxx |
hoodie.table.partition.fields | xxxx |
hoodie.archivelog.folder | archived |
hoodie.timeline.layout.version | 1 |
hoodie.table.name | wallet_xxxx |
hoodie.table.recordkey.fields | id |
hoodie.datasource.write.hive_style_partitioning | true |
hoodie.table.keygenerator.class | org.apache.hudi.keygen.SimpleKeyGenerator |
hoodie.populate.meta.fields | true |
hoodie.table.base.file.format | PARQUET |
hoodie.datasource.write.partitionpath.urlencode | false |
hoodie.table.version | 3 |
Stack Trace
WARN TaskSetManager: Lost task 32.0 in stage 11.0 (TID 695) (172.35.116.5 executor 1): org.apache.hudi.exception.HoodieIOException: Failed to read footer for parquet s3://XXXX/date=2020/46465be4-73c8-42e7-9905-088e15e0b627-0_23-626-12975_20220429052025969.parquet
at org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:185)
at org.apache.hudi.common.util.ParquetUtils.readFooter(ParquetUtils.java:201)
at org.apache.hudi.common.util.BaseFileUtils.readMinMaxRecordKeys(BaseFileUtils.java:109)
at org.apache.hudi.io.storage.HoodieParquetReader.readMinMaxRecordKeys(HoodieParquetReader.java:49)
at org.apache.hudi.io.HoodieRangeInfoHandle.getMinMaxKeys(HoodieRangeInfoHandle.java:39)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$4cbadf07$1(HoodieBloomIndex.java:149)
at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: No such file or directory 's3://XXXX/date=2020/46465be4-73c8-42e7-9905-088e15e0b627-0_23-626-12975_20220429052025969.parquet'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:532)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:61)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:456)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:441)
at org.apache.hudi.common.util.ParquetUtils.readMetadata(ParquetUtils.java:183)
... 33 more
Issue Analytics
- State:
- Created a year ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
We found an issue where in if there are retries with spark tasks, file that got tracked in commit metadata could differ from actual files that got finalized. we fixed it with 0.11 https://github.com/apache/hudi/pull/4753 May be this is related to the issue you are seeing.
thanks. going ahead and closing the github issue. feel free to open new one if you run into any issues.