question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CosmosDB sink error: 'write' can not be called on streaming Dataset/DataFrame

See original GitHub issue

I am reading a data stream from Event Hub in Spark (using Databricks). My goal is to be able to write the streamed data to a CosmosDB. However I get the following error: org.apache.spark.sql.AnalysisException: ‘write’ can not be called on streaming Dataset/DataFrame.

Is this scenario not supported?

Spark versions: 2.2.0 and 2.3.0

Libraries used:

  • json-20140107
  • rxnetty-0.4.20
  • azure-documentdb-1.14.0
  • azure-documentdb-rx-0.9.0-rc2
  • azure-cosmosdb-spark_2.2.0_2.11-1.0.0
  • rxjava-1.3.0
  • azure-eventhubs-databricks_2.11-3.4.0

My code:

import com.microsoft.azure.cosmosdb.spark._
import com.microsoft.azure.cosmosdb.spark.schema._
import com.microsoft.azure.cosmosdb.spark.streaming._
import com.microsoft.azure.cosmosdb.spark.config._

import org.apache.spark.sql.SparkSession
import org.apache.spark.eventhubs.common.EventHubsUtils
import org.apache.spark.sql
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.Trigger

val sparkConfiguration = EventHubsUtils.initializeSparkStreamingConfigurations

sparkConfiguration.setAppName("MeetupStructuredStreaming")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.allowBatching", "true")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.batchingTimeout", "60000")
sparkConfiguration.set("spark.streaming.receiver.writeAheadLog.enable", "true")
sparkConfiguration.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", "true")
sparkConfiguration.set("spark.streaming.receiver.writeAheadLog.closeFileAfterWrite", "true")
sparkConfiguration.set("spark.streaming.stopGracefullyOnShutdown", "true")

val spark = SparkSession.builder().config(sparkConfiguration).getOrCreate()

val configMap = Map(
  "Endpoint" -> "https://xxx.documents.azure.com:443/",
  "Masterkey" -> "xxxx",
  "Database" -> "meetup",
  "Collection" -> "event"
)

val ehStream = spark.readStream  
  .format("eventhubs")
  .option("eventhubs.policyname", "xxx")
  .option("eventhubs.policykey", "xxx")
  .option("eventhubs.namespace", "xx")
  .option("eventhubs.name", "meetup")
  .option("eventhubs.partition.count", "2")
  .option("eventhubs.maxRate", "100")
  .option("eventhubs.consumergroup", "streaming")
  .option("eventhubs.progressTrackingDir", "/tmp/eventhub-progress/meetup-events")
  .option("eventhubs.sql.containsProperties", "true")        
  .load
  .select(
    from_unixtime(col("enqueuedTime").cast(LongType)).alias("enqueuedTime")
    , from_unixtime(get_json_object(col("body").cast(StringType), "$.mtime").divide(1000)).alias("time")
    , get_json_object(col("body").cast(StringType), "$.name").alias("name")
    , get_json_object(col("body").cast(StringType), "$.event_url").alias("url")
    , get_json_object(col("body").cast(StringType), "$.status").alias("status")
    , get_json_object(col("body").cast(StringType), "$.venue.country").alias("country")
    , get_json_object(col("body").cast(StringType), "$.venue.city").alias("city")
    , get_json_object(col("body").cast(StringType), "$.group.category.shortname").alias("category"))

var cosmosDbStreamWriter = ehStream
  .writeStream
  .outputMode("append")
  .format(classOf[CosmosDBSinkProvider].getName).options(configMap)  
  .option("checkpointLocation", "/tmp/streamingCheckpoint")
  .trigger(Trigger.ProcessingTime(1000 * 3)) // every 3 seconds  
  .start()

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:20 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
fbeltraocommented, Apr 2, 2018

Hi @dennyglee,

Thanks for the update.

I have written a way to stream to Cosmos DB while the bug fix is not available. https://medium.com/@fbeltrao/an-introduction-to-spark-streaming-from-a-net-developer-d773a3275a73

2reactions
dennygleecommented, Mar 31, 2018

Apologies for the delayed response on this thread. We think we have identified the issue and resolved it in a recent PR. But before I say we’re done, let us run a few more tests to validate. We did these as part of #187 and #188 where both EventHubs and the Cosmos DB connectors are using the developer API internalCreateDataframe.

Read more comments on GitHub >

github_iconTop Results From Across the Web

org.apache.spark.sql.AnalysisException: 'write' can not be ...
My goal is to be able to write the streamed data to a CosmosDB. ... 'write' can not be called on streaming Dataset/DataFrame....
Read more >
Streaming data to CosmosDB - Databricks Community
I'm currently working on streaming data to DataBricks, my goal is to create a data stream on a first notebook ... But I...
Read more >
An introduction to Spark Streaming from a .NET Developer
StreamingQueryException : 'write' can not be called on streaming Dataset/DataFrame. A bug has been submitted and the Azure Cosmos DB team is ...
Read more >
Structured Streaming Programming Guide [Alpha]
This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table....
Read more >
Interact with Azure Cosmos DB using Apache Spark 2 in ...
Query Azure Cosmos DB analytical store; Write Spark DataFrame to Azure ... query is executed against the Spark table, actual data is not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found