question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] java.io.EOFException: Invalid position - exceeds the bounds of the stream

See original GitHub issue

Describe the problem you faced

Upgrading to 0.11.0 has caused consistent EOFExceptions on compaction for our Flink job. We have not seen this exception in 0.10.* versions.

To Reproduce

Steps to reproduce the behavior:

  1. Use Flink to write and compact data in S3 ?

Expected behavior

Compaction succeeds without an EOFException.

Environment Description

  • Hudi version : 0.11.0

  • Flink version : 1.14.2

  • Hive version : 2.3.9

  • Hadoop version : 2.10.1

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Additional context

Our config/setup looks something like this:


val path = "$basePath/$databaseName/$hudiTableName"
conf.setString(FlinkOptions.PATH, path)
conf.setString(FlinkOptions.TABLE_NAME, hudiTableName)
conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ")
conf.setString(FlinkOptions.SOURCE_AVRO_SCHEMA, schema)
conf.setInteger(FlinkOptions.WRITE_TASKS, 4)
conf.setInteger(FlinkOptions.COMPACTION_TASKS, 4)
conf.setString(FlinkOptions.RECORD_KEY_FIELD, "primaryKey")
conf.setString(FlinkOptions.PRECOMBINE_FIELD, "primaryKey") // TODO: pick something more intelligibly

conf.setString(FlinkOptions.PARTITION_PATH_FIELD, "") // TODO: determine partition strategy
conf.setString(FlinkOptions.KEYGEN_TYPE, KeyGeneratorType.NON_PARTITION.name)

conf.setInteger(FlinkOptions.COMPACTION_DELTA_COMMITS, checkpointsPerCompaction)

 ....

dataStream.let { Pipelines.bootstrap(conf, rowType, 1, it) }
                    .let { Pipelines.hoodieStreamWrite(conf, 1, it) }
                    .let { Pipelines.compact(conf, it) }

Stacktrace

Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 6 for operator stream_write (3/4)#0. Failure reason: Checkpoint was declined.
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:265)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:170)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:348)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:233)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:206)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:186)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:605)
	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:315)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1329)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1315)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1258)
	... 22 more
Caused by: org.apache.hudi.exception.HoodieIOException: Could not load Hoodie properties from s3://bucket/database/table/.hoodie/hoodie.properties
	at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:254)
	at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:125)
	at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:78)
	at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:668)
	at org.apache.hudi.client.BaseHoodieClient.createMetaClient(BaseHoodieClient.java:138)
	at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1448)
	at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1496)
	at org.apache.hudi.client.HoodieFlinkWriteClient.upsert(HoodieFlinkWriteClient.java:140)
	at org.apache.hudi.sink.StreamWriteFunction.lambda$initWriteFunction$1(StreamWriteFunction.java:184)
	at org.apache.hudi.sink.StreamWriteFunction.lambda$flushRemaining$7(StreamWriteFunction.java:461)
	at java.util.LinkedHashMap$LinkedValues.forEach(LinkedHashMap.java:608)
	at org.apache.hudi.sink.StreamWriteFunction.flushRemaining(StreamWriteFunction.java:454)
	at org.apache.hudi.sink.StreamWriteFunction.snapshotState(StreamWriteFunction.java:131)
	at org.apache.hudi.sink.common.AbstractStreamWriteFunction.snapshotState(AbstractStreamWriteFunction.java:157)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:87)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:219)
	... 33 more
Caused by: java.io.IOException: Unexpected end of stream pos=5200, contentLength=5175
	at [com.amazon.ws](http://com.amazon.ws/).emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:297)
	at [java.io](http://java.io/).BufferedInputStream.fill(BufferedInputStream.java:246)
	at [java.io](http://java.io/).BufferedInputStream.read1(BufferedInputStream.java:286)
	at [java.io](http://java.io/).BufferedInputStream.read(BufferedInputStream.java:345)
	at [java.io](http://java.io/).DataInputStream.read(DataInputStream.java:149)
	at [java.io](http://java.io/).DataInputStream.read(DataInputStream.java:100)
	at java.util.Properties$LineReader.readLine(Properties.java:508)
	at java.util.Properties.load0(Properties.java:353)
	at java.util.Properties.load(Properties.java:341)
	at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:303)
	at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:244)
	... 50 more
Caused by: java.io.EOFException: Invalid position: 5200, exceeds the bounds of the stream: [0, 5175]
	at [com.amazon.ws](http://com.amazon.ws/).emr.hadoop.fs.s3.S3FSInputStream.throwPositionOutOfBoundsException(S3FSInputStream.java:411)
	at [com.amazon.ws](http://com.amazon.ws/).emr.hadoop.fs.s3.S3FSInputStream.retrieveInputStreamWithInfo(S3FSInputStream.java:385)
	at [com.amazon.ws](http://com.amazon.ws/).emr.hadoop.fs.s3.S3FSInputStream.reopenStream(S3FSInputStream.java:378)
	at [com.amazon.ws](http://com.amazon.ws/).emr.hadoop.fs.s3.S3FSInputStream.read(S3FSInputStream.java:260)
	... 60 more

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:23 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
danny0405commented, May 23, 2022

Have fired a fix for flink here: https://github.com/apache/hudi/pull/5660

https://issues.apache.org/jira/browse/HUDI-3782 and https://issues.apache.org/jira/browse/HUDI-4138 may cause this bug.

The HoodieTable#getMetadataWriter is used by many async table service such as cleaning, compaction, clustering and so on, this method now would try to modify the table config each time it is called no matter whether metadata table is enabled/disabled.

In general, we should never make any side effect in the read code path of hoodie table config. And hoodie table metadata writer.

I’m not sure how to fix this on Spark side, have two ways to fix on my mind:

  1. make table config concurrency safe (not suggested because it is too heavy for a config)
  2. make sure the metadata cleaning only happens once for the whole Job lifetime (still risky because there may be multiple jobs, but with very small probability). I would suggest this way from my side.
2reactions
danny0405commented, May 11, 2022

I can give some details/background here,

the CompactionPlanOperator tries to schedule new compaction plan on each checkpoint complete event, in release 0.11, we tweak the strategy to not schedule new in the last compaction does not finish yet. While before 0.11, the new compaction is scheduled no matter whether the last compaction is on-going.

We also add a timed-out rollback strategy in release 0.11, the default timeout interval is 20 minutes. If a compaction does not finish within 20 minutes, we also try to rollback it, see: https://github.com/apache/hudi/blob/4258a715174e0a97271112148ab20ef9307e2bd8/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L91

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache spark memory configuration problems - Stack Overflow
... will retry by attempting to reopen stream. java.io.EOFException: Invalid position: 1856, exceeds the bounds of the stream: [0, ...
Read more >
Solved: Hive error: java.io.EOFException: Unexpected end o...
The error message i get is "java.io.EOFException: Unexpected end of input stream". I have done a check on HDFS for corrupt files and...
Read more >
PySpark: java.io.EOFException - Data Science Stack Exchange
I would look at memory use: Spark is (I presume) using all 4 cores, each with 6GB RAM ( ('spark.executor.memory', '6g') ); plus...
Read more >
subject:"\[GitHub\] \[hudi\] qianchutao commented ... - The Mail Archive
[GitHub] [hudi] qianchutao commented on issue #5553: [SUPPORT] java.io.EOFException: Invalid position - exceeds the bounds of the stream.
Read more >
Job failed due to reason: Unexpected end of input stream ...
I have a data flow that flow that works as expected when you step through the 'data preview' of each step in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found