Get Schema Registry settings from Spark config
See original GitHub issueCurrently user has to provide Schema Registry settings for every invocation of a transformation function, for example:
dataFrame.select(from_avro(col("value"), schemaRegistryConfig) as 'data)
A better approach could be to take these settings from the Spark configuration:
private val paramPrefix = "abris."
private def getSchemaRegistryParams: Map[String, String] = {
spark.conf
.getAll
.filterKeys(_.startsWith(paramPrefix))
.map { case (k, v) =>
k.substring(0, paramPrefix.length) -> v
}
}
Then user could specify these settings once when creating an instance of SparkSession, for example:
SparkSession.builder()
.config(SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC, "events")
.config(SchemaManager.PARAM_SCHEMA_REGISTRY_URL, "http://schema.registry.com")
...
.getOrCreate()
The above example assumes that all Abris-specific settings start with a common prefix abris.
- this makes it easier to fetch all settings at once.
Benefits:
- More concise code - no need to specify
schemaRegistryConfig
on every function call - Can use Spark-standard ways to configure:
SparkSession.builder().conf(...)
in the code,spark.conf
configuration file, etc. Moreover, can run Abris functions from spark-shell, thanks to--conf
command-line arguments. - Easier and safer to write unit tests. Currently each unit test must ensure, that
SchemaManager.reset()
is invoked. Otherwise a previous test may spoil a subsequent test. If SchemaManager takes settings from SparkSession, then tests only need to properly stop SparkSession after each test suite.
I would suggest to rework SchemaManager to a class:
class SchemaManager(implicit spark: SparkSession) {
private val schemaRegistryParams: Map[String, String] = getSchemaRegistryParams
private def getSchemaRegistryParams: Map[String, String] = {
// Get Abris-specific settings from the 'spark' instance
...
}
}
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5
Top Results From Across the Web
schema-registry-integration - Databricks
Schema Registry integration in Spark Structured Streaming. This notebook demonstrates how to use the from_avro / to_avro functions to read/write data ...
Read more >Schema Registry Overview - Confluent Documentation
Schema Registry is designed to be distributed, with single-primary architecture, and ZooKeeper/Kafka coordinates primary election (based on the configuration).
Read more >schema-registry-integration - Databricks - Microsoft Learn
Schema Registry integration in Spark Structured Streaming. This notebook demonstrates how to use the from_avro / to_avro functions to read/write data ...
Read more >Integrating Spark Structured Streaming with the Confluent ...
getSchema ) //key schema is typically just string but you can do the same ... It connects to Confluent Schema Registry through Spark...
Read more >Connecting Apache Spark to Apache Kafka Schema Registry ...
For more details, check the "Structured Streaming and Apache Kafka Schema Registry " ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The new version of Abris will allow users to connect to unlimited number of schema registries, so we want to be able to provide different configurations for each use.
I will implement the config loading, but it will take the config from Spark as a default and when you provide any config in the expression it will override the one from spark.
Hello, We had a discussion with Felipe, and we decided to not support this feature even though it could be useful in some simple use cases there are several reasons why not to do it:
Overall it would simplify most basic usage of Abris, but in other cases it would do nothing or would complicate things, and that’s why we think it’s not worth it.