question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyspark write to kafka topic with confluent schema throws Not a Union

See original GitHub issue

I have a dataframe with key,value as columns. I wanted to write to kafka with confluent_avro serialization. here is the snippet of code.

def convert_df_to_avro(self, spark_context, data_frame, schema_registry_url, topic):
        jvm_gateway = spark_context._gateway.jvm
        abris_avro  = jvm_gateway.za.co.absa.abris.avro
        naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
        schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
                                       "schema.registry.topic": topic,
                                       "value.schema.id": "latest",
                                       "value.schema.naming.strategy": naming_strategy}

        conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
        for k, v in schema_registry_config_dict.items():
            conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))

        serialized_df = data_frame.select(Column(abris_avro.functions.to_confluent_avro(data_frame._jdf.col("value"), conf_map))
                                            .alias("value"))

        return serialized_df
Caused by: org.apache.avro.AvroRuntimeException: Not a union: {schema from confluent}
	at org.apache.avro.Schema.getTypes(Schema.java:299)
	at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
	at org.apache.spark.sql.avro.AvroSerializer.<init>(AvroSerializer.scala:48)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer$lzycompute(CatalystDataToAvro.scala:44)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer(CatalystDataToAvro.scala:43)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.nullSafeEval(CatalystDataToAvro.scala:49)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Does this function to_confluent_avro available in pyspark ? even in readme also mentioned as to read from kafka, what about the write ? if available, can any one please provide example ?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17

github_iconTop GitHub Comments

1reaction
sivasai-quarticcommented, Jul 1, 2020

yeah @cerveada. It’s working fine with spark 3 with nullable fields, but it’s throwing Warning message. I’m using 3.2.0 abris with spark3.0. Just curious about how are this to_avro and from_avro working internally. Does it call schema registry API for every column serialization/deserialization in the data frame?

1reaction
cerveadacommented, Jun 30, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

Integrating Spark Structured Streaming with the Confluent ...
Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal....
Read more >
Avro Schema Serializer and Deserializer
Start Confluent Platform using the following command: · Verify registered schema types. · Use the producer to send Avro records in JSON as...
Read more >
Deserialzing Confluent Avro Records in Kafka with Spark
If you have a Kafka cluster populated with Avro records governed by Confluent Schema Registry, you can't simply add spark-avro dependency to ...
Read more >
Kafka, Avro Serialization and the Schema Registry
Confluent Schema Registry stores Avro Schemas for Kafka producers and ... You can change a type to a union that contains original type....
Read more >
Empty schema not supported - Cloudera Documentation
Writing a dataframe with an empty or nested empty schema using any file format is allowed and will not throw an exception. Spark...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found