Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pyspark write to kafka topic with confluent schema throws Not a Union

See original GitHub issue

I have a dataframe with key,value as columns. I wanted to write to kafka with confluent_avro serialization. here is the snippet of code.

def convert_df_to_avro(self, spark_context, data_frame, schema_registry_url, topic):
        jvm_gateway = spark_context._gateway.jvm
        abris_avro  = jvm_gateway.za.co.absa.abris.avro
        naming_strategy = getattr(getattr(abris_avro.read.confluent.SchemaManager, "SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
        schema_registry_config_dict = {"schema.registry.url": schema_registry_url,
                                       "schema.registry.topic": topic,
                                       "value.schema.id": "latest",
                                       "value.schema.naming.strategy": naming_strategy}

        conf_map = getattr(getattr(jvm_gateway.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
        for k, v in schema_registry_config_dict.items():
            conf_map = getattr(conf_map, "$plus")(jvm_gateway.scala.Tuple2(k, v))

        serialized_df = data_frame.select(Column(abris_avro.functions.to_confluent_avro(data_frame._jdf.col("value"), conf_map))
                                            .alias("value"))

        return serialized_df

Caused by: org.apache.avro.AvroRuntimeException: Not a union: {schema from confluent}
	at org.apache.avro.Schema.getTypes(Schema.java:299)
	at org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
	at org.apache.spark.sql.avro.AvroSerializer.<init>(AvroSerializer.scala:48)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer$lzycompute(CatalystDataToAvro.scala:44)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.serializer(CatalystDataToAvro.scala:43)
	at za.co.absa.abris.avro.sql.CatalystDataToAvro.nullSafeEval(CatalystDataToAvro.scala:49)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Does this function to_confluent_avro available in pyspark ? even in readme also mentioned as to read from kafka, what about the write ? if available, can any one please provide example ?

Issue Analytics

State:
Created 3 years ago
Comments:17

Top GitHub Comments

1reaction

sivasai-quarticcommented, Jul 1, 2020

yeah @cerveada. It’s working fine with spark 3 with nullable fields, but it’s throwing Warning message. I’m using 3.2.0 abris with spark3.0. Just curious about how are this to_avro and from_avro working internally. Does it call schema registry API for every column serialization/deserialization in the data frame?

1reaction

cerveadacommented, Jun 30, 2020

@brandon-stanley You can do it the same way as it’s done in the documentation. https://github.com/AbsaOSS/ABRiS/blob/master/documentation/confluent-avro-documentation.md#spark-to-avro