[FEATURE] An integrated solution for pulsar Schema and Flink
See original GitHub issueIs your feature request related to a problem? Please describe. Since the Pulsar connector did not adhere to the Flink format standard, this resulted in the maintenance of schema conversions, data structure conversions, etc. This part of the code was very difficult for me to maintain, and it was difficult for new people to participate in this part of the work. The FlinkPulsarRowSink and FlinkPulsarRowSource feature implementations.
Another important reason is due to the lack of Sink for Format, which may affect the progress of the merge back into the community.
Describe the solution you’d like
- plan 2: Flink Avro, JSON Format are all standard formats that can be easily integrated with Pulsar. Let the user freely use Flink Avro, JSON, etc. format, and we generate Pulsar Schema information when the user uses Avro, JSON, etc.
When new serialization methods are supported in Pulsar, we can easily add or use the flink format.
Describe alternatives you’ve considered
- plan 1: Create a pulsar format and give the user the option to use it. We need to maintain the conversion relationship between Flink data types and Pulsar data types, which is not easy.
Additional context
Pulsar has its own Schema, which supports serialization, deserialization with POJO data. Flink also has a Format function that supports serialization and deserialization of data.
When using the Flink format, the data written to or read from Pulsar is bytes, we can use AutoProduceBytesSchema
to produce the message, which allows us to wrap the real Schema information in bytes and will check that the bytes data is properly scrubbed.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
Referring to the FLIP-107 and Kafka implementations, we will implement an adaptive serialization and deserialization interface, which will contain the serialization components for external inputs, Pulsar Schema generation. The responsibilities in the Sink interface will be further simplified and the redundant implementation classes will be removed.
The responsibilities of the components in the current Pulsar connector are not clear. So many implementations of FlinkPulsarRowSource, ReaderThread, RowReaderThread, PulsarFetcher, RowReaderThread, etc. The Flink Row type is enabled and replaced by the RowData type. The irrational implementation of the Pulsar serialization and deserialization functions is the cause of the confusion.
Final Solution: Wrap the Flink format instance as an implementation of the Pulsar Schema interface. This satisfies the Flink format specification and meets Pulsar’s expectations.