CosmosDB sink error: 'write' can not be called on streaming Dataset/DataFrame
See original GitHub issueI am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: ‘write’ can not be called on streaming Dataset/DataFrame.
Is this scenario not supported?
Spark versions: 2.2.0 and 2.3.0
Libraries used:
- json-20140107
- rxnetty-0.4.20
- azure-documentdb-1.14.0
- azure-documentdb-rx-0.9.0-rc2
- azure-cosmosdb-spark_2.2.0_2.11-1.0.0
- rxjava-1.3.0
- azure-eventhubs-databricks_2.11-3.4.0
My code:
import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.streaming._
import com.microsoft.azure.cosmosdb.spark.config._
import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs.common.EventHubsUtils
import org.apache.spark.sql
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.Trigger
val sparkConfiguration = EventHubsUtils.initializeSparkStreamingConfigurations
sparkConfiguration.setAppName("MeetupStructuredStreaming")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.allowBatching", "true")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.batchingTimeout", "60000")
sparkConfiguration.set("spark.streaming.receiver.writeAheadLog.enable", "true")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", "true")
sparkConfiguration.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", "true")
sparkConfiguration.set("spark.streaming.stopGracefullyOnShutdown", "true")
val spark = SparkSession.builder().config(sparkConfiguration).getOrCreate()
val configMap = Map(
"Endpoint" -> "https://xxx.documents.azure.com:443/",
"Masterkey" -> "xxxx",
"Database" -> "meetup",
"Collection" -> "event"
)
val ehStream = spark.readStream
.format("eventhubs")
.option("eventhubs.policyname", "xxx")
.option("eventhubs.policykey", "xxx")
.option("eventhubs.namespace", "xx")
.option("eventhubs.name", "meetup")
.option("eventhubs.partition.count", "2")
.option("eventhubs.maxRate", "100")
.option("eventhubs.consumergroup", "streaming")
.option("eventhubs.progressTrackingDir", "/tmp/eventhub-progress/meetup-events")
.option("eventhubs.sql.containsProperties", "true")
.load
.select(
from_unixtime(col("enqueuedTime").cast(LongType)).alias("enqueuedTime")
, from_unixtime(get_json_object(col("body").cast(StringType), "$.mtime").divide(1000)).alias("time")
, get_json_object(col("body").cast(StringType), "$.name").alias("name")
, get_json_object(col("body").cast(StringType), "$.event_url").alias("url")
, get_json_object(col("body").cast(StringType), "$.status").alias("status")
, get_json_object(col("body").cast(StringType), "$.venue.country").alias("country")
, get_json_object(col("body").cast(StringType), "$.venue.city").alias("city")
, get_json_object(col("body").cast(StringType), "$.group.category.shortname").alias("category"))
var cosmosDbStreamWriter = ehStream
.writeStream
.outputMode("append")
.format(classOf[CosmosDBSinkProvider].getName).options(configMap)
.option("checkpointLocation", "/tmp/streamingCheckpoint")
.trigger(Trigger.ProcessingTime(1000 * 3)) // every 3 seconds
.start()
Issue Analytics
- State:
- Created 6 years ago
- Comments:20 (6 by maintainers)
Top Results From Across the Web
org.apache.spark.sql.AnalysisException: 'write' can not be ...
My goal is to be able to write the streamed data to a CosmosDB. ... 'write' can not be called on streaming Dataset/DataFrame....
Read more >Streaming data to CosmosDB - Databricks Community
I'm currently working on streaming data to DataBricks, my goal is to create a data stream on a first notebook ... But I...
Read more >An introduction to Spark Streaming from a .NET Developer
StreamingQueryException : 'write' can not be called on streaming Dataset/DataFrame. A bug has been submitted and the Azure Cosmos DB team is ...
Read more >Structured Streaming Programming Guide [Alpha]
This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table....
Read more >Interact with Azure Cosmos DB using Apache Spark 2 in ...
Query Azure Cosmos DB analytical store; Write Spark DataFrame to Azure ... query is executed against the Spark table, actual data is not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @dennyglee,
Thanks for the update.
I have written a way to stream to Cosmos DB while the bug fix is not available. https://medium.com/@fbeltrao/an-introduction-to-spark-streaming-from-a-net-developer-d773a3275a73
Apologies for the delayed response on this thread. We think we have identified the issue and resolved it in a recent PR. But before I say we’re done, let us run a few more tests to validate. We did these as part of #187 and #188 where both EventHubs and the Cosmos DB connectors are using the developer API internalCreateDataframe.