question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extension function forEachBatch can be added for DataStreamWriter

See original GitHub issue

I found it difficult to call DataStreamWriter.foreachBatch because source code won’t compile until explicit construction of VoidFunction2 is added.

So I suggest adding such an extension for DataStreamWriter:

import org.apache.spark.api.java.function.VoidFunction2
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.streaming.DataStreamWriter

public fun <T> DataStreamWriter<T>.forEachBatch(
    func: (batch: Dataset<T>, batchId: Long) -> Unit
): DataStreamWriter<T> = foreachBatch(
    VoidFunction2 { batch, batchId ->
        func(batch, batchId)
    }
)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
Pihanyacommented, Jun 2, 2022

@hawkaa, hello! Thank you for the remark.

In Kotlin we follow coding conventions. It is said there: “Names of functions, properties and local variables start with a lowercase letter and use camel case”. This is the first argument why forEachBatch is better option vs foreachBatch.

Also in kotlin-spark-api we have forEach function that calls Spark’s foreach under the hood. So forEachBatch would be more idiomatic (in terms of kotlin-spark-api) comparing to Spark’s foreachBatch.

Therefore, if choosing function naming between forEachBatch vs foreachBatch for kotlin-spark-api it is more propriate to use the first variant, that is available in release v1.1.0.

1reaction
Jolanrensencommented, Apr 20, 2022

Your specific example doesn’t work because List<CatalogueNode> cannot be encoded. It’s an interface which can have functions, values etc. so Spark does not know how to encode that. Only a collection of actual data classes would be allowed. So like

val folderChildren: List<Folder>? = null,
val identityRefChildren: List<IdentityRef>? = null,

But then again, the circular reference appears of course. Unfortunately, I don’t think we have a solution for that. Especially since Spark itself does not support circular references. It makes sense if you consider that Datasets are essentially column/row data structures. If circular references were allowed, an infinite recursion could exist within a cell which cannot be saved. Some things I found regarding this: https://issues.apache.org/jira/browse/SPARK-33598

As for java.util.UUID, Spark does not support this as well, so I think it’s outside the scope of the Kotlin Spark API to add support for this specific class. Usually we only mirror org.apache.spark.sql.Encoders

Read more comments on GitHub >

github_iconTop Results From Across the Web

pyspark.sql.streaming.DataStreamWriter.foreachBatch
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output DataFrame is...
Read more >
DataStreamWriter · The Internals of Spark Structured Streaming
Pass the output rows of each batch to a library that is designed for the batch jobs only. Reuse batch data sources for...
Read more >
Use foreachBatch to write to arbitrary data sinks with ...
foreachBatch (...) allows you to specify a function that is executed on the output data of every micro-batch of the streaming query.
Read more >
Multi-writer error for persistence function in Structure stream
I'm trying to write into 2 differents databases in my structure streaming process. for this, I'm using this code:
Read more >
How to perform spark streaming foreachbatch? - ProjectPro
In this PySpark Big Data Project, you will gain an in-depth knowledge and hands-on experience working with various SQL functions including joins ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found