Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve Avro canonicalizer

See original GitHub issue

The current implementation misses some situations where schemas are semantically the same. For example, if one rendering of a schema specifies a namespace and another specifies full names:

{"name":"n","namespace":"ns","type":"record","fields":[]}

is essentially the same schema as:

{"name":"ns.n","type":"record","fields":[]}

Another example is when a schema contains a doc attribute, which exists to provide user documentation and is ignored when the schema is used for serialization or resolution. So this schema:

{"name":"n","type":"record","fields":[],"doc":"Hello world"}

can be processed the same as this schema:

{"name":"n","type":"record","fields":[],"doc":"Hi world!"}

Currently the registry will treat the above as separate schemas, when retrieving an artifact’s metadata by content (e.g. a GET request to /artifacts/{artifactId}/meta). This behaviour can also be observed by Kafka client serdes that fetch a schema with metadata by schema content (e.g. a POST to /ccompat/subjects/{schema}).

I think that the canonicalization used for Avro could be improved by comparing schemas using Parsing Canonical Form which the Avro spec describes as:

…a transformation of a writer’s schema that let’s us define what it means for two schemas to be “the same” for the purpose of reading data written against the schema.

This transformation is implemented by the SchemaNormalization class which is part of the Java Avro implementation.

If this sounds like a useful improvement, I’m happy to open a PR that switches over the implementation of AvroContentCanonicalizer.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

forsbergcommented, Aug 28, 2020

I hit a side-effect of this today in a rather annoying way. Consider the following example schema:

{'type': 'record',
 'name': 'ID',
 'namespace': 'com.example',
 'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}]}

I used the Python Schema Registry Client to register my schema. For some reason, it runs the schema through fastavro.parse_schema first, which transforms it into this:

{'type': 'record',
 'name': 'com.example.ID',
 'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}],
 '__fastavro_parsed': True,
 '__named_schemas': {'com.example.ID': {'type': 'record',
   'name': 'com.example.ID',
   'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}]}}}

Note: the name was modified to include the namespace, and namespace field removed. This is perfectly fine.

I publish messages to Avro topic, and try to use The Kafka Connect S3 Connector to read them and save on S3.

The S3 connector first retrieves the schema ID by use of /api/ccompat/schemas/ids endpoint of Apicurio Registry, and then does a POST to /api/ccompat/subjects/<topic>-<key|value> to check which subject-version the schema has. However, as it turns out, the S3 Connector reformats the schema into its namespaced form again, i.e. the first representation above.

Apicurio registry, not understanding that both are exactly the same schema, responds with a 404, and S3 Connector is unhappy and crashes 😦

0reactions

carlesarnalcommented, Oct 14, 2022

Hi @carlesarnal

As per contributing your canonicalizer, please, go ahead, contributions are always more than welcome!

Good to hear. Will do.

the canonicalizer is exactly the same

I didn’t quite understand this. Which canonicalizer you are referring to? this one?

Yes, exactly that one.

Top Results From Across the Web

Normalize Avro Standard Canonical Schema updated latest ...

Jira This is new feature in JAVA component to normalise the Avro schema using canonical order ... Improve Avro canonicalizer Apicurio/apicurio-registry#300.

Our approach to fast Avro serialization and deserialization in ...

Our approach to fast Avro serialization and deserialization in JVM. Check out how we improved the Apache Avro processing performance. Posted by ...

Fast Avro Write | Lenses.io Blog

This article presents how Avro lib writes to files and how we can achieve significant performance improvements by parallelizing the write.

Security update for jackson-databind, jackson ... - SUSE

types in absence of registered custom (de)serializers + Improve ... writing + (avro) Add 'logicalType' support for some 'java.time' types; ...

SUSE alert SUSE-SU-2022:1678-1 (jackson-databind ...

types in absence of registered custom (de)serializers + Improve ... X and Avro's JacksonUtils + 'jackson-databind' should not be full ...