Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SNIP 42: Kafka Avro Serializer and Deserializer

See original GitHub issue

Motivation
Approach
Documentation Changes
Test Plan

Motivation

KoP supports producing and consuming via Kafka clients. The Kafka API is different from Pulsar API. Let's focus on the producer first. Using a Kafka producer to produce messages requires the configuration of the key serializer and value serializer.

final Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
final KafkaProducer<String, String> producer = new KafkaProducer<>(props);
// Here we can just use ProducerRecord<>, I added the generic types here
// to indicate it accepts two generic parameters.
producer.send(new ProducerRecord<String, String>("my-topic", "key", "value")).get();

For the code example above, two StringSerializer objects will be created inside the KafkaProducer and they are responsible to serialize the key and value into bytes. See the serialize method in the Serializer interface.

public interface Serializer<T> extends Closeable {

    byte[] serialize(String topic, T data);

and the StringSerializer implements the Serializer<String> so that the String value in ProducerRecord could be serialized to bytes.

Similarly, the Kafka consumer requires a Deserializer implementation to decode the bytes into the generic type.

public interface Deserializer<T> extends Closeable {

    T deserialize(String topic, byte[] data);

As we can see, Kafka already provides some default implementations for serializers and deserializers for the primitive types. But for generic types like Object or Avro's GenericRecord/SpecificRecord, Kafka users have to use Confluent's Avro serializer and deserializer (let's say SerDes in short). The Confluent's Avro SerDes leverages a default method in both interfaces,

    default void configure(Map<String, ?> configs, boolean isKey) {
        // intentionally left blank
    }

which allow users to configure the URL of the Schema Registry that is also provided by Confluent. Then the SerDes is responsible to register or get schema on the Confluent Schema Registry.

To support users who have the demand for Avro Schema, there are two ways:

Implement the same REST APIs of Confluent Schema Registry.
Implement our own Avro SerDes.

We have discussed about these two solutions before. As a summary, here is the comparison from my perspective.

For the 1st solution:

PROs
- No changes needed for Kafka clients of all languages that support communication with Confluent Schema Registry.
CONs
- We must use the Confluent REST API to manage these schemas. It means we need to implement or reuse the admin tools and learn these admin APIs.
- Confluent has implements the schema that has a unique integer ID and could be shared by multiple topics. It's impossible to be mapped to Pulsar schema. So we need to save the additional Confluent Schema related metadata somewhere (like system topics or ZooKeeper).

For the 2nd solution:

PROs
- We can reuse the Pulsar schema.
- It could be possible to access the same topic among Pulsar clients and Kafka clients.
- It's much more easier than the 1st solution.
CONs
- We need to develop the SerDes for Kafka clients of all languages.
- For those applications that already use Confluent's Avro SerDes, we need to change the dependency to our own SerDes.

This proposal focus on the 2nd solution and the SerDes for Java clients.

Approach

API Design

Just use Object as the generic parameter to be consistent with the Confluent Avro SerDes.

package io.streamnative.kafka.serializers;

public class KafkaAvroSerializer implements Serializer<Object> {

    @Override
    public void configure(Map<String, ?> configs, boolean isKey) {
        /* ... */
    }

    @Override
    public byte[] serialize(String topic, Object value) {
        /* ... */
    }
}

package io.streamnative.kafka.serializers;

public class KafkaAvroDeserializer implements Deserializer<Object> {

    @Override
    public void configure(Map<String, ?> configs, boolean isKey) {
        /* ... */
    }

    @Override
    public Object deserialize(String topic, byte[] bytes) {
        /* ... */
    }
}

Schema Registry Client

We need to register schema when serializing and get schema when deserializing. We can use Pulsar's REST API for it. But for simplicity, we can leverage the Schemas class from the pulsar-client-admin dependency for the initial implementation.

Then a SchemaRegistryClient is responsible for these admin operations.

// It should be noted that the side effect of this method is counter-intuitive.
// 1. It tries to update the schema if the schema exists.
// 2. No exception would be thrown if it failed.
// 3. It doesn't return a schema version.
void createSchema(String topic, SchemaInfo schemaInfo) throws PulsarAdminException;

// Get the schema version after `createSchema` is called.
// NOTE: We can also compare the schema JSON to see if the `createSchema` has
//  updated the schema. However, to be consistent with Pulsar client's behavior,
//  even if the producer's schema is incompatible with the topic's schema, the
//  creation won't fail and messages of older schemas can be sent.
SchemaInfoWithVersion getSchemaInfoWithVersion(String topic) throws PulsarAdminException;

Get schema (in deserializer):

// It should be noted if there's no schema associated with the version, this
// method will return null. In this case, fallback to the other overload version
// to get latest schema.
SchemaInfo getSchemaInfo(String topic, long version) throws PulsarAdminException;

// Use this method if there is no schema version (*) or as a fallback.
// (*) See the next section for when to call this method.
SchemaInfo getSchemaInfo(String topic) throws PulsarAdminException;

However, if these operations were performed each time a message arrived, the performance would decrease sharply. We could maintain a cache in the Schema Registry Client.

// topic -> { schema -> schema version }
private final Map<String, Map<Schema, Long>> versionCache = new ConcurrentHashMap<>();
// topic -> { schema version -> schema }
private final Map<String, Map<Long, Schema>> schemaCache = new ConcurrentHashMap<>();

Schema Version Header

In Pulsar, producers set the schema version in the message metadata.

message MessageMetadata {
    /* ... */
    optional bytes schema_version = 16;
}

Then in Message#getValue, it could get the schema version and get the schema from the Broker.

However, for Kafka clients, the message metadata is added in the server side (KoP). There is no way for KoP to know the schema version from Kafka clients without additional information.

In this proposal, producer adds 10 bytes at the head of each message value.

| MARKER (2 bytes) | Schema Version for example (8 bytes)    |
| :--------------- | :-------------------------------------- |
| 0x03 0x04        | 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 |

The byte array [0x03 0x04] is a fixed marker to indicate the following 8 bytes represent the schema version. Then we can use Schema.INT64 to encode or decode. (To avoid introducing the dependency on pulsar-client, we can migrate the schema manually)

Serializer

Parse the schema from the class of the user provided object.
Register the schema via the Schema Registry Client.
Serialize the Schema Version Header.
Serialize the object to bytes and append after the header.

Deserializer

Parse the schema version from the Schema Version Header.
If the schema version exists, get the schema associated with the version via the Schema Registry Client.
Otherwise, get the latest schema via the Schema Registry Client.
Deserialize the bytes via the schema.

Interaction with Pulsar clients

Here we only discuss how to deal with the schema version when entryFormat is not pulsar because when entryFormat=pulsar, the message format conversion must be performed each time.

Kafka Producer to Pulsar Consumer

KafkaPayloadProcessoris a plugin configured in Pulsar client to convert the messages from Kafka format to Pulsar format (Message), see PIP 96 for details.

When the processor converts a Kafka record (MemoryRecords) to Pulsar message (Message):

Deserialize the bytes and try to get the schema version from the Schema Version Header.
If the schema version exists, set the schema version in message metadata.

Pulsar Producer to Kafka Consumer

When KoP reads entries from the managed cursor:

Get the schema version from the message metadata.
If the schema version exists, serialize it to the Schema version header and prepend to the head of each entry.

Overhead Analysis

Since KafkaPayloadProcessor only affects the performance in Pulsar Consumer, it doesn't affect the performance of KoP. It could bring some performance loss when Pulsar consumer consumes messages from Kafka producer.

However, since KafkaPayloadProcessor needs to convert each single message (in the batch) from Kafka format to Pulsar format. The overhead is only parsing the first 10 bytes of each record. Here is the current code:

for (Record record : records.records()) {
    // TODO: parse the 10 bytes at the head of record.value(), in
    //  `newByteBufFromRecord`, a MessageMetadata will be created so we can
    //  set the schema version in this method.
    final MessagePayload singlePayload = newByteBufFromRecord(record);

The overhead in KoP happens when KoP handle the FETCH request from Kafka consumer. Let's see the current workflow:

Read some entries from the managed ledger.
Set the offset field in each entry.
Merge these entries into the buffer to client.

We can see copying bytes cannot be avoided. If the entries were sent by Pulsar producer, the only overhead is adding 10 bytes each message to copy.

Documentation Changes

Describe how to configure the SerDes.
Explain the possible configurations related to SerDes.

For producers, configure the serializer:

props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);

For consumers, configure the deserializer:

props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);

Currently, we only aimed at the serializer for value, not key.

In addition, we must add a configuration to specify Pulsar's HTTP server URL for both serializer and deserializer.

props.put("schema.registry.url", "http://localhost:8080");

We can also add some optional configurations for SerDes:

// Whether to allow the field to be null
props.put("allow.null", true);
// Some configs specific to the implementation, like the cache size limit...

Test Plan

Add test for various Pulsar schema compatibilities (Forward/Backward/Full).

Since serialize is only called when KafkaProducer#send is called and deserialize is only called when KafkaConsumer receives a message, we must send and receive at least 1 message .

Take Forward compatibility strategy as example.

Create Producer-A with Schema-1 and send at least 1 message
Validate the schema is Schema-1.
Create a consumer with Schema-1 and receive it.
Create Producer-B with a compatible Schema-2 and send some messages.
Validate the schema is Schema-2.
Send messages via Producer-A. Sending messages of older schemas should be allowed
Receive all these messages and validate.

It's similar for other Schema compatibility strategies, though there will be some differences more or less.

We should also test the interaction between Pulsar clients and Kafka clients.

Issue Analytics

State:
Created a year ago
Comments:27 (7 by maintainers)

Top GitHub Comments

1reaction

BewareMyPowercommented, May 21, 2022

The header with the key “schema.version” will respond to the kafka client, is there any problem?

Not a big problem. But maybe some logic at application side might rely on the count of the headers.

when users use the BytesDeserializer,

They must know the code format in advance. For example, even if there’re no extra bytes before the AVRO serialized bytes, they still need to know details. For example, in Pulsar, the fields are allowed to be null with the default AVRO schema. If users don’t know that, the deserialization might fail.

0reactions

BewareMyPowercommented, Jun 6, 2022

After the internal discussion in StreamNative, this task might be delayed for a while.