Pyspark write to kafka topic with confluent schema throws Not a Union
See original GitHub issueI have a dataframe with key,value as columns. I wanted to write to kafka with confluent_avro serialization. here is the snippet of code.
def convert_df_to_avro(self, spark_context, data_frame, schema_registry_url, topic):
jvm_gateway = spark_context._gateway.jvm
abris_avro = jvm_gateway.za.co.absa.abris.avro
naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
"schema.registry.topic": topic,
"value.schema.id": "latest",
"value.schema.naming.strategy": naming_strategy}
conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
for k, v in schema_registry_config_dict.items():
conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))
serialized_df = data_frame.select(Column(abris_avro.functions.to_confluent_avro(data_frame._jdf.col("value"), conf_map))
.alias("value"))
return serialized_df
Caused by: org.apache.avro.AvroRuntimeException: Not a union: {schema from confluent}
at org.apache.avro.Schema.getTypes(Schema.java:299)
at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
at org.apache.spark.sql.avro.AvroSerializer.<init>(AvroSerializer.scala:48)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer$lzycompute(CatalystDataToAvro.scala:44)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer(CatalystDataToAvro.scala:43)
at za.co.absa.abris.avro.sql.CatalystDataToAvro.nullSafeEval(CatalystDataToAvro.scala:49)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Does this function to_confluent_avro available in pyspark ? even in readme also mentioned as to read from kafka, what about the write ? if available, can any one please provide example ?
Issue Analytics
- State:
- Created 3 years ago
- Comments:17
Top Results From Across the Web
Integrating Spark Structured Streaming with the Confluent ...
Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal....
Read more >Avro Schema Serializer and Deserializer
Start Confluent Platform using the following command: · Verify registered schema types. · Use the producer to send Avro records in JSON as...
Read more >Deserialzing Confluent Avro Records in Kafka with Spark
If you have a Kafka cluster populated with Avro records governed by Confluent Schema Registry, you can't simply add spark-avro dependency to ...
Read more >Kafka, Avro Serialization and the Schema Registry
Confluent Schema Registry stores Avro Schemas for Kafka producers and ... You can change a type to a union that contains original type....
Read more >Empty schema not supported - Cloudera Documentation
Writing a dataframe with an empty or nested empty schema using any file format is allowed and will not throw an exception. Spark...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
yeah @cerveada. It’s working fine with spark 3 with nullable fields, but it’s throwing Warning message. I’m using 3.2.0 abris with spark3.0. Just curious about how are this to_avro and from_avro working internally. Does it call schema registry API for every column serialization/deserialization in the data frame?
@brandon-stanley You can do it the same way as it’s done in the documentation. https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md#spark-to-avro