Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE] An integrated solution for pulsar Schema and Flink

See original GitHub issue

Is your feature request related to a problem? Please describe. Since the Pulsar connector did not adhere to the Flink format standard, this resulted in the maintenance of schema conversions, data structure conversions, etc. This part of the code was very difficult for me to maintain, and it was difficult for new people to participate in this part of the work. The FlinkPulsarRowSink and FlinkPulsarRowSource feature implementations.

Another important reason is due to the lack of Sink for Format, which may affect the progress of the merge back into the community.

Describe the solution you’d like

plan 2: Flink Avro, JSON Format are all standard formats that can be easily integrated with Pulsar. Let the user freely use Flink Avro, JSON, etc. format, and we generate Pulsar Schema information when the user uses Avro, JSON, etc.
When new serialization methods are supported in Pulsar, we can easily add or use the flink format.

Describe alternatives you’ve considered

plan 1: Create a pulsar format and give the user the option to use it. We need to maintain the conversion relationship between Flink data types and Pulsar data types, which is not easy.

Additional context

Pulsar has its own Schema, which supports serialization, deserialization with POJO data. Flink also has a Format function that supports serialization and deserialization of data.

When using the Flink format, the data written to or read from Pulsar is bytes, we can use AutoProduceBytesSchema to produce the message, which allows us to wrap the real Schema information in bytes and will check that the bytes data is properly scrubbed.

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

jianyun8023commented, Nov 23, 2020

Referring to the FLIP-107 and Kafka implementations, we will implement an adaptive serialization and deserialization interface, which will contain the serialization components for external inputs, Pulsar Schema generation. The responsibilities in the Sink interface will be further simplified and the redundant implementation classes will be removed.

The responsibilities of the components in the current Pulsar connector are not clear. So many implementations of FlinkPulsarRowSource, ReaderThread, RowReaderThread, PulsarFetcher, RowReaderThread, etc. The Flink Row type is enabled and replaced by the RowData type. The irrational implementation of the Pulsar serialization and deserialization functions is the cause of the confusion.

0reactions

jianyun8023commented, Nov 24, 2020

Final Solution: Wrap the Flink format instance as an implementation of the Pulsar Schema interface. This satisfies the Flink format specification and meets Pulsar’s expectations.