Extension function forEachBatch can be added for DataStreamWriter
See original GitHub issueI found it difficult to call DataStreamWriter.foreachBatch because source code won’t compile until explicit construction of VoidFunction2 is added.
So I suggest adding such an extension for DataStreamWriter:
import org.apache.spark.api.java.function.VoidFunction2
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.streaming.DataStreamWriter
public fun <T> DataStreamWriter<T>.forEachBatch(
func: (batch: Dataset<T>, batchId: Long) -> Unit
): DataStreamWriter<T> = foreachBatch(
VoidFunction2 { batch, batchId ->
func(batch, batchId)
}
)
Issue Analytics
- State:
- Created a year ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
pyspark.sql.streaming.DataStreamWriter.foreachBatch
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output DataFrame is...
Read more >DataStreamWriter · The Internals of Spark Structured Streaming
Pass the output rows of each batch to a library that is designed for the batch jobs only. Reuse batch data sources for...
Read more >Use foreachBatch to write to arbitrary data sinks with ...
foreachBatch (...) allows you to specify a function that is executed on the output data of every micro-batch of the streaming query.
Read more >Multi-writer error for persistence function in Structure stream
I'm trying to write into 2 differents databases in my structure streaming process. for this, I'm using this code:
Read more >How to perform spark streaming foreachbatch? - ProjectPro
In this PySpark Big Data Project, you will gain an in-depth knowledge and hands-on experience working with various SQL functions including joins ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@hawkaa, hello! Thank you for the remark.
In Kotlin we follow coding conventions. It is said there: “Names of functions, properties and local variables start with a lowercase letter and use camel case”. This is the first argument why
forEachBatchis better option vsforeachBatch.Also in kotlin-spark-api we have
forEachfunction that calls Spark’sforeachunder the hood. SoforEachBatchwould be more idiomatic (in terms of kotlin-spark-api) comparing to Spark’sforeachBatch.Therefore, if choosing function naming between
forEachBatchvsforeachBatchfor kotlin-spark-api it is more propriate to use the first variant, that is available in releasev1.1.0.Your specific example doesn’t work because
List<CatalogueNode>cannot be encoded. It’s an interface which can have functions, values etc. so Spark does not know how to encode that. Only a collection of actual data classes would be allowed. So likeBut then again, the circular reference appears of course. Unfortunately, I don’t think we have a solution for that. Especially since Spark itself does not support circular references. It makes sense if you consider that Datasets are essentially column/row data structures. If circular references were allowed, an infinite recursion could exist within a cell which cannot be saved. Some things I found regarding this: https://issues.apache.org/jira/browse/SPARK-33598
As for java.util.UUID, Spark does not support this as well, so I think it’s outside the scope of the Kotlin Spark API to add support for this specific class. Usually we only mirror
org.apache.spark.sql.Encoders