Improve Avro canonicalizer
See original GitHub issueThe current implementation misses some situations where schemas are semantically the same. For example, if one rendering of a schema specifies a namespace and another specifies full names:
{"name":"n","namespace":"ns","type":"record","fields":[]}
is essentially the same schema as:
{"name":"ns.n","type":"record","fields":[]}
Another example is when a schema contains a doc
attribute, which exists to provide user documentation and is ignored when the schema is used for serialization or resolution. So this schema:
{"name":"n","type":"record","fields":[],"doc":"Hello world"}
can be processed the same as this schema:
{"name":"n","type":"record","fields":[],"doc":"Hi world!"}
Currently the registry will treat the above as separate schemas, when retrieving an artifact’s metadata by content (e.g. a GET request to /artifacts/{artifactId}/meta
). This behaviour can also be observed by Kafka client serdes that fetch a schema with metadata by schema content
(e.g. a POST to /ccompat/subjects/{schema}
).
I think that the canonicalization used for Avro could be improved by comparing schemas using Parsing Canonical Form which the Avro spec describes as:
…a transformation of a writer’s schema that let’s us define what it means for two schemas to be “the same” for the purpose of reading data written against the schema.
This transformation is implemented by the SchemaNormalization class which is part of the Java Avro implementation.
If this sounds like a useful improvement, I’m happy to open a PR that switches over the implementation of AvroContentCanonicalizer
.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:12 (8 by maintainers)
Top GitHub Comments
I hit a side-effect of this today in a rather annoying way. Consider the following example schema:
Note: the name was modified to include the namespace, and namespace field removed. This is perfectly fine.
The S3 connector first retrieves the schema ID by use of /api/ccompat/schemas/ids endpoint of Apicurio Registry, and then does a POST to /api/ccompat/subjects/<topic>-<key|value> to check which subject-version the schema has. However, as it turns out, the S3 Connector reformats the schema into its namespaced form again, i.e. the first representation above.
Yes, exactly that one.