question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve Avro canonicalizer

See original GitHub issue

The current implementation misses some situations where schemas are semantically the same. For example, if one rendering of a schema specifies a namespace and another specifies full names:

{"name":"n","namespace":"ns","type":"record","fields":[]}

is essentially the same schema as:

{"name":"ns.n","type":"record","fields":[]}

Another example is when a schema contains a doc attribute, which exists to provide user documentation and is ignored when the schema is used for serialization or resolution. So this schema:

{"name":"n","type":"record","fields":[],"doc":"Hello world"}

can be processed the same as this schema:

{"name":"n","type":"record","fields":[],"doc":"Hi world!"}

Currently the registry will treat the above as separate schemas, when retrieving an artifact’s metadata by content (e.g. a GET request to /artifacts/{artifactId}/meta). This behaviour can also be observed by Kafka client serdes that fetch a schema with metadata by schema content (e.g. a POST to /ccompat/subjects/{schema}).

I think that the canonicalization used for Avro could be improved by comparing schemas using Parsing Canonical Form which the Avro spec describes as:

…a transformation of a writer’s schema that let’s us define what it means for two schemas to be “the same” for the purpose of reading data written against the schema.

This transformation is implemented by the SchemaNormalization class which is part of the Java Avro implementation.

If this sounds like a useful improvement, I’m happy to open a PR that switches over the implementation of AvroContentCanonicalizer.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
forsbergcommented, Aug 28, 2020

I hit a side-effect of this today in a rather annoying way. Consider the following example schema:

{'type': 'record',
 'name': 'ID',
 'namespace': 'com.example',
 'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}]}
  1. I used the Python Schema Registry Client to register my schema. For some reason, it runs the schema through fastavro.parse_schema first, which transforms it into this:
{'type': 'record',
 'name': 'com.example.ID',
 'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}],
 '__fastavro_parsed': True,
 '__named_schemas': {'com.example.ID': {'type': 'record',
   'name': 'com.example.ID',
   'fields': [{'doc': 'ID', 'name': 'id', 'type': 'int'}]}}}

Note: the name was modified to include the namespace, and namespace field removed. This is perfectly fine.

  1. I publish messages to Avro topic, and try to use The Kafka Connect S3 Connector to read them and save on S3.

The S3 connector first retrieves the schema ID by use of /api/ccompat/schemas/ids endpoint of Apicurio Registry, and then does a POST to /api/ccompat/subjects/<topic>-<key|value> to check which subject-version the schema has. However, as it turns out, the S3 Connector reformats the schema into its namespaced form again, i.e. the first representation above.

  1. Apicurio registry, not understanding that both are exactly the same schema, responds with a 404, and S3 Connector is unhappy and crashes 😦
0reactions
carlesarnalcommented, Oct 14, 2022

Hi @carlesarnal

As per contributing your canonicalizer, please, go ahead, contributions are always more than welcome!

Good to hear. Will do.

the canonicalizer is exactly the same

I didn’t quite understand this. Which canonicalizer you are referring to? this one?

Yes, exactly that one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Normalize Avro Standard Canonical Schema updated latest ...
Jira This is new feature in JAVA component to normalise the Avro schema using canonical order ... Improve Avro canonicalizer Apicurio/apicurio-registry#300.
Read more >
Our approach to fast Avro serialization and deserialization in ...
Our approach to fast Avro serialization and deserialization in JVM. Check out how we improved the Apache Avro processing performance. Posted by ...
Read more >
Fast Avro Write | Lenses.io Blog
This article presents how Avro lib writes to files and how we can achieve significant performance improvements by parallelizing the write.
Read more >
Security update for jackson-databind, jackson ... - SUSE
types in absence of registered custom (de)serializers + Improve ... writing + (avro) Add 'logicalType' support for some 'java.time' types; ...
Read more >
SUSE alert SUSE-SU-2022:1678-1 (jackson-databind ...
types in absence of registered custom (de)serializers + Improve ... X and Avro's JacksonUtils + 'jackson-databind' should not be full ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found