Pulsar SQL: support user defined indexes
See original GitHub issueIs your feature request related to a problem? Please describe.
Currently, there is no index used to query topic using presto. __publish_time__
can be considered as index because of ledger storage way but it’s not a real one.
Describe the solution you’d like AvroSchema used to insert to topic should comes with a indexes definition. Since then, we should be able to have managedledger for indexes referencing classical managedledgers or messageid? And then configure pulsar presto impl to use user defined indexes from schema. (This is a suggestion to initialize the discussion, as @jerrypeng and I discussed it’s a large discussion to have).
Describe alternatives you’ve considered There are probably multiples ways to do it, feel free to suggest your pov.
Additional context Reduce the query runtime.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (12 by maintainers)
I don’t think it is a good idea to add an index definition to the schema definition. The schema definition defines the structure of the original data. The index definition depends on the schema definition but it is different from the original data. So the index definition should be associated with the storage that is used for storing the index data. For example, if we are using another managed ledger for storing the index, then the index definition should be the schema definition of the managed ledger. Does that make sense?
Is there any progress on this issue? Being able to support indexes in Pulsar Sql will be a very meaningful feature.
One way is to support it natively, and the other way I think it can be achieved through tiered storage. For example, combined with the data lake, with the help of Apache Hudi and so on.
I saw some articles about the combination of hudi and pulsar, is there any progress? @sijie