[Feature Request] Iceberg integrates with Pulsar, supports java to read iceberg tables sequentially
See original GitHub issueMotivation
Apache Pulsar is doing integration with Iceberg, and take Iceberg as tiered storage to offload topic cold data into Iceberg. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Iceberg), Pulsar broker will fetch data from Iceberg by Java API, and package them into Pulsar format and dispatch to consumer side.
For Pulsar Iceberg integration, we first use iceberg writer to streaming write topics messages into iceberg table in one thread. And then we use iceberg reader to read records from iceberg table by streaming. In pulsar’s read case, we should ensure the records read by the same order with the write. However, we found current Iceberg Java reader implementation doesn’t support read records by order or doesn’t support order by
operation on reading.
What we need
- We need Iceberg to support read records out which keep the writer order, or support order by specific fields.
- We need to read the change log of the iceberg table by streaming.
- Does the Iceberg community has plan to support this feature?
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:11 (9 by maintainers)
Top GitHub Comments
In terms of order granularity, here are 3 of them
We can discuss it in the community sync. Feel free to add it to the agenda. I’d suggest a proposal for this before that. It may present a rough idea, or multiple ideas. It may address some concerns like how it work with the current data file compaction. A data file compaction may break the order between multiple data files.
It sounds like the pulsar reader hope the apache iceberg reader guarantee the record-level writing order semantic. As far as I know, the table format based on DFS files is hard to maintain the record-level order semantic. Currently, the iceberg table format was optimized for batch analysis readers, not for a message queue consumer.
I think @rdblue and @RussellSpitzer had discussed the approach about dual writing the iceberg table format and message queue, for downstream reader to consume the records with the writing order, but I don’t see a public discussion or design document to describe the details. Maybe they can provide more input about this issue.