question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in writing CDC data with Flink

See original GitHub issue

I’m using Flink to write CDC data to iceberg, and I’m using the master branch.I came across the following problem.

Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot determine history between starting snapshot null and current 8203201752131271868
	at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:46)
	at org.apache.iceberg.MergingSnapshotProducer.validateDataFilesExist(MergingSnapshotProducer.java:313)
	at org.apache.iceberg.BaseRowDelta.validate(BaseRowDelta.java:95)
	at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:162)
	at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:283)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:213)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:197)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:189)
	at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:282)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitOperation(IcebergFilesCommitter.java:298)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitDeltaTxn(IcebergFilesCommitter.java:285)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitUpToCheckpoint(IcebergFilesCommitter.java:210)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.initializeState(IcebergFilesCommitter.java:147)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:106)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285)
	at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:291)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:473)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:469)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:522)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
	at java.lang.Thread.run(Thread.java:748)

When the Flink job is running, I start a spark job to rewrite data.This is my spark code

RewriteDataFilesActionResult rewriteDataFilesActionResult = Actions.forTable(sparkSession, table)
                        .rewriteDataFiles()
                        .targetSizeInBytes(TARGET_SIZE)
                        .splitLookback(SPLIT_LOOK_BACK)
                        .execute();

 List<String> orphanFiles = Actions.forTable(sparkSession, table)
                        .removeOrphanFiles()
                        .olderThan(expireOrphanTime)
                        .execute();

BaseExpireSnapshotsSparkAction expireSnapshotsSparkAction = new BaseExpireSnapshotsSparkAction(sparkSession, table);
                ExpireSnapshots.Result result = expireSnapshotsSparkAction
                        .expireOlderThan(expireSnapshotTime)
                        .retainLast(30)
                        .execute();

Can someone help me see this problem? Thanks. Does RewriteAction rewrite some datafiles that Flink has not yet committed?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
rdbluecommented, Oct 8, 2021

There seems to be a lot of confusion around this issue. It was just referenced again in this comment.

The problem is not the behavior of the validation. That’s doing the right thing for how it is configured. I think that the problem is that the validation is configured to look over the entire table history, which is clearly not correct.

The problematic validation is validateDataFilesExist. That only needs to be used when adding position deletes because equality deletes do not reference specific data files. Since position deletes for a CDC stream are only added against data files that are being added, I don’t think that validation even needs to be configured. We can simply remove these two lines: https://github.com/apache/iceberg/blob/1cb04128661ea147c2eec4dd1d025698f9604993/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java#L286-L287

@openinx and @stevenzwu, what do you think?

0reactions
rdbluecommented, Oct 19, 2021

I’ve marked the fix for inclusion in the 0.12.1 patch release.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When snapshot stage finished, flink cdc throws ... - GitHub
When snapshot stage finished, flink cdc throws PSQLException: An I/O error occurred while sending to the backend #1265.
Read more >
Realtime Compute for Apache Flink:Common SQL errors
In Realtime Compute for Apache Flink whose engine version is vvr-3.0.7-flink-1.12 or earlier, the MySQL Change Data Capture (CDC) source ...
Read more >
[GitHub] [iceberg] coolderli opened a new issue #2482
I'm using Flink to write CDC data to iceberg, and I'm using the master branch.I came across the following problem.
Read more >
Debezium | Apache Flink
Debezium is a CDC (Changelog Data Capture) tool that can stream changes in real-time from MySQL, PostgreSQL, Oracle, Microsoft SQL Server and many...
Read more >
MongoDB CDC Connector — Flink CDC documentation
Whether details of failed operations should be written to the log file. copy.existing, optional, true, Boolean, Whether copy existing data from source ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found