Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in writing CDC data with Flink

See original GitHub issue

I’m using Flink to write CDC data to iceberg， and I’m using the master branch.I came across the following problem.

Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot determine history between starting snapshot null and current 8203201752131271868
	at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:46)
	at org.apache.iceberg.MergingSnapshotProducer.validateDataFilesExist(MergingSnapshotProducer.java:313)
	at org.apache.iceberg.BaseRowDelta.validate(BaseRowDelta.java:95)
	at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:162)
	at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:283)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:213)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:197)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:189)
	at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:282)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitOperation(IcebergFilesCommitter.java:298)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitDeltaTxn(IcebergFilesCommitter.java:285)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitUpToCheckpoint(IcebergFilesCommitter.java:210)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.initializeState(IcebergFilesCommitter.java:147)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:106)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:473)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:469)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:522)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)

When the Flink job is running, I start a spark job to rewrite data.This is my spark code

RewriteDataFilesActionResult rewriteDataFilesActionResult = Actions.forTable(sparkSession, table)
                        .rewriteDataFiles()
                        .targetSizeInBytes(TARGET_SIZE)
                        .splitLookback(SPLIT_LOOK_BACK)
                        .execute();

 List<String> orphanFiles = Actions.forTable(sparkSession, table)
                        .removeOrphanFiles()
                        .olderThan(expireOrphanTime)
                        .execute();

BaseExpireSnapshotsSparkAction expireSnapshotsSparkAction = new BaseExpireSnapshotsSparkAction(sparkSession, table);
                ExpireSnapshots.Result result = expireSnapshotsSparkAction
                        .expireOlderThan(expireSnapshotTime)
                        .retainLast(30)
                        .execute();

Can someone help me see this problem？ Thanks. Does RewriteAction rewrite some datafiles that Flink has not yet committed?

Issue Analytics

State:
Created 2 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

2reactions

rdbluecommented, Oct 8, 2021

There seems to be a lot of confusion around this issue. It was just referenced again in this comment.

The problem is not the behavior of the validation. That’s doing the right thing for how it is configured. I think that the problem is that the validation is configured to look over the entire table history, which is clearly not correct.

The problematic validation is validateDataFilesExist. That only needs to be used when adding position deletes because equality deletes do not reference specific data files. Since position deletes for a CDC stream are only added against data files that are being added, I don’t think that validation even needs to be configured. We can simply remove these two lines: https://github.com/apache/iceberg/blob/1cb04128661ea147c2eec4dd1d025698f9604993/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L286-L287

@openinx and @stevenzwu, what do you think?

0reactions

rdbluecommented, Oct 19, 2021

I’ve marked the fix for inclusion in the 0.12.1 patch release.

Top Results From Across the Web

When snapshot stage finished, flink cdc throws ... - GitHub

When snapshot stage finished, flink cdc throws PSQLException: An I/O error occurred while sending to the backend #1265.

Realtime Compute for Apache Flink:Common SQL errors

In Realtime Compute for Apache Flink whose engine version is vvr-3.0.7-flink-1.12 or earlier, the MySQL Change Data Capture (CDC) source ...

[GitHub] [iceberg] coolderli opened a new issue #2482

I'm using Flink to write CDC data to iceberg， and I'm using the master branch.I came across the following problem.

Debezium | Apache Flink

Debezium is a CDC (Changelog Data Capture) tool that can stream changes in real-time from MySQL, PostgreSQL, Oracle, Microsoft SQL Server and many...

MongoDB CDC Connector — Flink CDC documentation

Whether details of failed operations should be written to the log file. copy.existing, optional, true, Boolean, Whether copy existing data from source ...