Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

protobuf schema doesn't parse protobuf wire protocol correctly.

See original GitHub issue

There are a whole slew of problems. First and foremost, it isn’t honoring the wire protocol definition - because the docs are structured ridiculously. They document the wire format for avro in a table, making it appear that that is the total of the wire format. But then after the table, there’s a paragraph that goes on to describe how protobufs add an extra array of descriptor information to the header before the protobuf data. Look after the green box on this page: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format

The protobuf schema implementation completely ignores the array of message descriptor indexes, instead just searching the buffer for the first non-zero byte. But even the first message in a protobuf file can be encoded with the array 1,0 instead of just a 0, implying an array of length 1, and a message index of 0, instead of an array of length 0 and an assumed message index of 0. Note also that the integers in the array, including the length, are variable length zig zag encoded, not just network byte order, so you have to parse them correctly, too, and cannot just assume 4 bytes per integer.

Then there’s the problem that things other than messages (Types) can be declared within a .proto file so even if the indexes were being honored in the recursive loop implemented by getNestedTypeName(), the exit condition only checks for Type and Namespace, but it is entirely possible to encounter an Enum or a Service or a Method, so it is necessary to iterate over the keys in parent until you find either a Namespace or a Type, and then continue traversing the descriptor hierarchy from there, rather than always assuming the first key is the one to use.

This is relatively simple to implement, by just iterating over the keys until you find one which is instanceof Type or Namespace.

Finally, the code assumes that there is a parsedMessage.package field, which may be there when parsing a .proto string, but is definitely NOT there when parsing a JSON protobuf descriptor. When the root is created via a JSON descriptor, you have to parse the package name by iterating through the Namespace declarations in the parent hierarchy. It’d be great if the package makes it into the JSON descriptor, but until it does, it is probably safer to determine the package name dynamically rather than looking for it from the parser, since it might go away from parsed protos as easily as it could be added to json descriptors.

if (reflection instanceof Namespace && !(reflection instanceof Type) && reflection.nested)
    return reflection.name + '.' + this.getNestedTypeName(reflection.nested)
return keys[0]

will return the fully qualified name without relying on the package string

I’m going to take a stab at fixing the code in ProtoSchema.ts and submitting a PR to fix all of that (and the corresponding changes in the serializer which will generate the correct array by walking the descriptor hierarchy).

Issue Analytics

State:
Created 2 years ago
Reactions:6
Comments:9

Top GitHub Comments

4reactions

ideasculptorcommented, Oct 1, 2021

On the decode side, it looks like this:

    private decodeHeader(topic: string, buffer: Buffer): ProtoInfo {
        let bufferReader = Reader.create(buffer)
        const magicByte = bufferReader.uint32()
        const schemaId = HostOrder(bufferReader.fixed32())
        const arrayLen = bufferReader.sint32()
        const msgIndexes = new Array<number>(arrayLen)
        for (let i = 0; i < arrayLen; i++) {
            msgIndexes[i] = bufferReader.sint32()
        }
        return {
            magicByte: magicByte,
            schemaId: schemaId,
            msgIndexes: msgIndexes,
            bytesRead: bufferReader.pos,
        }
    }

    public async deserialize(topic: string, buffer: Buffer): Promise<Message<{}>> {
        if (buffer.length < 6) {
            throw new Error(`buffer with length ${buffer.length} is not long enough to contain a protobuf`)
        }
        const protoInfo = this.decodeHeader(topic, buffer)

        const type = await this.protobufResolver.ResolveProtobuf(topic, protoInfo.schemaId, protoInfo.msgIndexes)
        let bufferReader = Reader.create(buffer)
        bufferReader.skip(protoInfo.bytesRead)
        return type.decode(bufferReader)
    }

protobufResolver uses registry client, topic name, and info parsed from wire protocol to resolve a protobuf Type instance, which is then used to decode the protobuf. That allows me to inject whatever logic I want into the deserializer via protobufResolver for figuring out the type that is encoded in the payload, since correctly computing the message type from message indexes isn’t really possible with protobufjs as it is currently implemented. At least not if you also have imported references to other protobufs in your .proto files, since the schema parsed out of the registry won’t include the references. By delegating to a resolver, I can resort to quick hacks like hardcoding the type name based on indexes and topic name, for example.

3reactions

ideasculptorcommented, Oct 1, 2021

I’ll probably open-source the serializer and deserializer I built, but it’s not likely to happen for a week or two.

Top Results From Across the Web

Language Guide (proto3) | Protocol Buffers - Google Developers

If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if...

Whats wrong with this buffer? (how to decode a protobuf buffer ...

Just passing it does not modify its type but any of the functions involved could possibly convert the buffer back and forth to...

Parsing Protocol-Buffers without knowing the .proto

I ended up writing a parser based on the wire protocol specs on the google page about it. But this might be easier...

Python fails to parse protobuf structure in http body

So I've implemented the following protobuf based protocol ... number 6 and wire type 2 (see encoding details here), but your schema does...

Protobuf Schema Serializer and Deserializer

This document describes how to use Protocol Buffers (Protobuf) with the Apache Kafka® Java client and console tools. Protobuf Serializer¶. Plug the ...