Error in writing CDC data with Flink
See original GitHub issueI’m using Flink to write CDC data to iceberg, and I’m using the master branch.I came across the following problem.
Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot determine history between starting snapshot null and current 8203201752131271868
at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:46)
at org.apache.iceberg.MergingSnapshotProducer.validateDataFilesExist(MergingSnapshotProducer.java:313)
at org.apache.iceberg.BaseRowDelta.validate(BaseRowDelta.java:95)
at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:162)
at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:283)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:213)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:197)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:189)
at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:282)
at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitOperation(IcebergFilesCommitter.java:298)
at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitDeltaTxn(IcebergFilesCommitter.java:285)
at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitUpToCheckpoint(IcebergFilesCommitter.java:210)
at org.apache.iceberg.flink.sink.IcebergFilesCommitter.initializeState(IcebergFilesCommitter.java:147)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:106)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:473)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:469)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:522)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
at java.lang.Thread.run(Thread.java:748)
When the Flink job is running, I start a spark job to rewrite data.This is my spark code
RewriteDataFilesActionResult rewriteDataFilesActionResult = Actions.forTable(sparkSession, table)
.rewriteDataFiles()
.targetSizeInBytes(TARGET_SIZE)
.splitLookback(SPLIT_LOOK_BACK)
.execute();
List<String> orphanFiles = Actions.forTable(sparkSession, table)
.removeOrphanFiles()
.olderThan(expireOrphanTime)
.execute();
BaseExpireSnapshotsSparkAction expireSnapshotsSparkAction = new BaseExpireSnapshotsSparkAction(sparkSession, table);
ExpireSnapshots.Result result = expireSnapshotsSparkAction
.expireOlderThan(expireSnapshotTime)
.retainLast(30)
.execute();
Can someone help me see this problem? Thanks. Does RewriteAction rewrite some datafiles that Flink has not yet committed?
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (9 by maintainers)
Top Results From Across the Web
When snapshot stage finished, flink cdc throws ... - GitHub
When snapshot stage finished, flink cdc throws PSQLException: An I/O error occurred while sending to the backend #1265.
Read more >Realtime Compute for Apache Flink:Common SQL errors
In Realtime Compute for Apache Flink whose engine version is vvr-3.0.7-flink-1.12 or earlier, the MySQL Change Data Capture (CDC) source ...
Read more >[GitHub] [iceberg] coolderli opened a new issue #2482
I'm using Flink to write CDC data to iceberg, and I'm using the master branch.I came across the following problem.
Read more >Debezium | Apache Flink
Debezium is a CDC (Changelog Data Capture) tool that can stream changes in real-time from MySQL, PostgreSQL, Oracle, Microsoft SQL Server and many...
Read more >MongoDB CDC Connector — Flink CDC documentation
Whether details of failed operations should be written to the log file. copy.existing, optional, true, Boolean, Whether copy existing data from source ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
There seems to be a lot of confusion around this issue. It was just referenced again in this comment.
The problem is not the behavior of the validation. That’s doing the right thing for how it is configured. I think that the problem is that the validation is configured to look over the entire table history, which is clearly not correct.
The problematic validation is
validateDataFilesExist
. That only needs to be used when adding position deletes because equality deletes do not reference specific data files. Since position deletes for a CDC stream are only added against data files that are being added, I don’t think that validation even needs to be configured. We can simply remove these two lines: https://github.com/apache/iceberg/blob/1cb04128661ea147c2eec4dd1d025698f9604993/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L286-L287@openinx and @stevenzwu, what do you think?
I’ve marked the fix for inclusion in the 0.12.1 patch release.