Iceberg Parquet writer sometimes fails with "Failed to read footer of file: io.prestosql.plugin.iceberg.HdfsInputFile"
See original GitHub issue2020-09-17T08:24:05.7369454Z [ERROR] testDecimal(io.prestosql.plugin.iceberg.TestIcebergParquetSmoke) Time elapsed: 3.565 s <<< FAILURE!
2020-09-17T08:24:05.7377258Z java.lang.RuntimeException: Failed to read footer of file: io.prestosql.plugin.iceberg.HdfsInputFile@1475b99b
2020-09-17T08:24:05.7389432Z at io.prestosql.testing.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:114)
2020-09-17T08:24:05.7395042Z at io.prestosql.testing.DistributedQueryRunner.execute(DistributedQueryRunner.java:442)
2020-09-17T08:24:05.7400031Z at io.prestosql.testing.QueryAssertions.assertUpdate(QueryAssertions.java:71)
2020-09-17T08:24:05.7406976Z at io.prestosql.testing.AbstractTestQueryFramework.assertUpdate(AbstractTestQueryFramework.java:224)
2020-09-17T08:24:05.7413663Z at io.prestosql.testing.AbstractTestQueryFramework.assertUpdate(AbstractTestQueryFramework.java:219)
2020-09-17T08:24:05.7422715Z at io.prestosql.plugin.iceberg.AbstractTestIcebergSmoke.testDecimalWithPrecisionAndScale(AbstractTestIcebergSmoke.java:170)
2020-09-17T08:24:05.7429748Z at io.prestosql.plugin.iceberg.AbstractTestIcebergSmoke.testDecimal(AbstractTestIcebergSmoke.java:146)
2020-09-17T08:24:05.7434541Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2020-09-17T08:24:05.7440757Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2020-09-17T08:24:05.7447773Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-09-17T08:24:05.7451432Z at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2020-09-17T08:24:05.7458460Z at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2020-09-17T08:24:05.7462465Z at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
2020-09-17T08:24:05.7467184Z at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
2020-09-17T08:24:05.7471959Z at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
2020-09-17T08:24:05.7476892Z at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
2020-09-17T08:24:05.7478806Z at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
2020-09-17T08:24:05.7480586Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-09-17T08:24:05.7482421Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-09-17T08:24:05.7483658Z at java.base/java.lang.Thread.run(Thread.java:834)
2020-09-17T08:24:05.7486402Z Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to read footer of file: io.prestosql.plugin.iceberg.HdfsInputFile@1475b99b
2020-09-17T08:24:05.7488551Z at org.apache.iceberg.parquet.ParquetUtil.fileMetrics(ParquetUtil.java:76)
2020-09-17T08:24:05.7490933Z at io.prestosql.plugin.iceberg.IcebergParquetFileWriter.lambda$getMetrics$0(IcebergParquetFileWriter.java:72)
2020-09-17T08:24:05.7493832Z at io.prestosql.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
2020-09-17T08:24:05.7496026Z at io.prestosql.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:96)
2020-09-17T08:24:05.7498243Z at io.prestosql.plugin.iceberg.IcebergParquetFileWriter.getMetrics(IcebergParquetFileWriter.java:72)
2020-09-17T08:24:05.7501323Z at io.prestosql.plugin.iceberg.IcebergPageSink.finish(IcebergPageSink.java:167)
2020-09-17T08:24:05.7506586Z at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorPageSink.finish(ClassLoaderSafeConnectorPageSink.java:77)
2020-09-17T08:24:05.7512214Z at io.prestosql.operator.TableWriterOperator.finish(TableWriterOperator.java:208)
2020-09-17T08:24:05.7516330Z at io.prestosql.operator.Driver.processInternal(Driver.java:397)
2020-09-17T08:24:05.7519155Z at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
2020-09-17T08:24:05.7522175Z at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
2020-09-17T08:24:05.7523668Z at io.prestosql.operator.Driver.processFor(Driver.java:276)
2020-09-17T08:24:05.7524996Z at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
2020-09-17T08:24:05.7527007Z at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
2020-09-17T08:24:05.7528947Z at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
2020-09-17T08:24:05.7530073Z at io.prestosql.$gen.Presto_testversion____20200917_081756_431.run(Unknown Source)
2020-09-17T08:24:05.7532350Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-09-17T08:24:05.7539859Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-09-17T08:24:05.7540876Z at java.base/java.lang.Thread.run(Thread.java:834)
2020-09-17T08:24:05.7541435Z Caused by: java.io.IOException: Stream is closed!
2020-09-17T08:24:05.7542559Z at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:56)
2020-09-17T08:24:05.7544159Z at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:75)
2020-09-17T08:24:05.7545679Z at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:255)
2020-09-17T08:24:05.7547651Z at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:300)
2020-09-17T08:24:05.7549183Z at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
2020-09-17T08:24:05.7550617Z at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
2020-09-17T08:24:05.7551836Z at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:460)
2020-09-17T08:24:05.7553048Z at org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:441)
2020-09-17T08:24:05.7577451Z at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
2020-09-17T08:24:05.7586410Z at org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream.seek(ChecksumFileSystem.java:331)
2020-09-17T08:24:05.7593957Z at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
2020-09-17T08:24:05.7602555Z at org.apache.parquet.hadoop.util.H1SeekableInputStream.seek(H1SeekableInputStream.java:46)
2020-09-17T08:24:05.7610448Z at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:526)
2020-09-17T08:24:05.7617741Z at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:712)
2020-09-17T08:24:05.7624544Z at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:597)
2020-09-17T08:24:05.7630795Z at org.apache.iceberg.parquet.ParquetUtil.fileMetrics(ParquetUtil.java:73)
2020-09-17T08:24:05.7633229Z ... 18 more
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Iceberg Parquet writer sometimes fails with "Failed to read ...
Iceberg Parquet writer sometimes fails with "Failed to read footer of file: io.prestosql.plugin.iceberg.HdfsInputFile" #5201.
Read more >As cool as Iceberg - Ajith Shetty - Medium
Every query requires a file listing and discarding based on the footer(parquet). Altering partitioning column over the time and should support previous and...
Read more >Apache Iceberg Achieves Milestone 1.0 Release - Dremio
Tail reads involve reading the final bytes in the file where the footer exists and range reads assist in reading portions of column...
Read more >Using the Iceberg framework in AWS Glue
You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using...
Read more >Iceberg connector — Trino 403 Documentation
Iceberg data files can be stored in either Parquet, ORC or Avro format, ... the read and write operations with ORC files performed...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@rdblue should we be rewrapping here even if the stream is a
HadoopSeekableInputStream
? We lose reference to it and that causes delegate to close prematurely through finalizer.Trino wraps
HadoopInputFile
withHdfsInputFile
, butHdfsInputFile#newStream
doesn’t wrapHadoopInputFile#newStream
. If we start wrapping it using a simple wrapper, then the Trino side error would go away because the wrapper causesParquetIO#stream
to create aParquetInputStreamAdapter
. That wayHadoopSeekableInputStream
isn’t lost, and closed properly.The RC seems to be unintended closing of input stream through an abandoned parent stream’s finalizer.
For
ParquetFileReader.open
call inParquetUtils.fileMetrics
, the input is aParquetInputFile
(wrapping Trino’sHdfsInputFile
).ParquetFileReader.open
triggers anewStream()
on the input file. This implementation causes aSeekableInputStream
to be created with a delegate, and then thrown away to use a different wrapper.Since
SeekableInputStream
is thrown away, it’s available for GC and finalizer invocation. which also closes the delegate it has access to. So next attempts to read from the delegate fail resulting in the stacktrace we see.Only if the finalizer triggers before
ParquetUtils.fileMetrics
has completed, we see the error. (TheSeekableInputStream
is never closed except through the finalizer, so we deterministically see the warnings “Unclosed input stream created by”, but there can be some delay.)I think the issue might’ve been introduced with https://github.com/trinodb/trino/pull/4460, where we started using a wrapper
HdfsInputFile
instead ofHadoopInputFile
. There seems to be special handling for the latter in Iceberg code. https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ParquetIO.java#L51 cc @electrum