Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Iceberg integrates with Pulsar, supports java to read iceberg tables sequentially

See original GitHub issue

Motivation

Apache Pulsar is doing integration with Iceberg, and take Iceberg as tiered storage to offload topic cold data into Iceberg. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Iceberg), Pulsar broker will fetch data from Iceberg by Java API, and package them into Pulsar format and dispatch to consumer side.

For Pulsar Iceberg integration, we first use iceberg writer to streaming write topics messages into iceberg table in one thread. And then we use iceberg reader to read records from iceberg table by streaming. In pulsar’s read case, we should ensure the records read by the same order with the write. However, we found current Iceberg Java reader implementation doesn’t support read records by order or doesn’t support order by operation on reading.

What we need

We need Iceberg to support read records out which keep the writer order, or support order by specific fields.
We need to read the change log of the iceberg table by streaming.
Does the Iceberg community has plan to support this feature?

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:11 (9 by maintainers)

Top GitHub Comments

3reactions

flyraincommented, Apr 25, 2022

In terms of order granularity, here are 3 of them

snapshot level, it is there already.
file level, not in the current table spec. I assume it is what you want for this feature request, please confirm.
row level, not in the current table spec, I’m not even sure it is something worth to pursue.

We can discuss it in the community sync. Feel free to add it to the agenda. I’d suggest a proposal for this before that. It may present a rough idea, or multiple ideas. It may address some concerns like how it work with the current data file compaction. A data file compaction may break the order between multiple data files.

2reactions

openinxcommented, Apr 24, 2022

However, we found current Iceberg Java reader implementation doesn’t support read records by order or doesn’t support order by operation on reading.

It sounds like the pulsar reader hope the apache iceberg reader guarantee the record-level writing order semantic. As far as I know, the table format based on DFS files is hard to maintain the record-level order semantic. Currently, the iceberg table format was optimized for batch analysis readers, not for a message queue consumer.

I think @rdblue and @RussellSpitzer had discussed the approach about dual writing the iceberg table format and message queue, for downstream reader to consume the records with the writing order, but I don’t see a public discussion or design document to describe the details. Maybe they can provide more input about this issue.

Top Results From Across the Web

Amazon SageMaker Feature Store now supports Apache ...

Amazon SageMaker Feature Store now supports the ability to create feature groups in the offline store in Apache Iceberg table format.

Iceberg AWS Integrations

Iceberg provides integration with different AWS services through the iceberg-aws module. This section describes how to use Iceberg with AWS. Enabling AWS ...

Iceberg | DataHub

The DataHub Iceberg source plugin extracts metadata from Iceberg tables stored in a distributed or local file system. Typically, Iceberg tables are stored ......

Using Debezium to Create a Data Lake with Apache Iceberg

Apache Iceberg is an "open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, ...

Query Apache Iceberg tables | BigQuery - Google Cloud

Apache Iceberg is an open source table format that supports petabyte scale data tables. The Iceberg open specification lets you run multiple ...