question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Iceberg integrates with Pulsar, supports java to read iceberg tables sequentially

See original GitHub issue

Motivation

Apache Pulsar is doing integration with Iceberg, and take Iceberg as tiered storage to offload topic cold data into Iceberg. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Iceberg), Pulsar broker will fetch data from Iceberg by Java API, and package them into Pulsar format and dispatch to consumer side.

For Pulsar Iceberg integration, we first use iceberg writer to streaming write topics messages into iceberg table in one thread. And then we use iceberg reader to read records from iceberg table by streaming. In pulsar’s read case, we should ensure the records read by the same order with the write. However, we found current Iceberg Java reader implementation doesn’t support read records by order or doesn’t support order by operation on reading.

What we need

  • We need Iceberg to support read records out which keep the writer order, or support order by specific fields.
  • We need to read the change log of the iceberg table by streaming.
  • Does the Iceberg community has plan to support this feature?

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

3reactions
flyraincommented, Apr 25, 2022

In terms of order granularity, here are 3 of them

  1. snapshot level, it is there already.
  2. file level, not in the current table spec. I assume it is what you want for this feature request, please confirm.
  3. row level, not in the current table spec, I’m not even sure it is something worth to pursue.

We can discuss it in the community sync. Feel free to add it to the agenda. I’d suggest a proposal for this before that. It may present a rough idea, or multiple ideas. It may address some concerns like how it work with the current data file compaction. A data file compaction may break the order between multiple data files.

2reactions
openinxcommented, Apr 24, 2022

However, we found current Iceberg Java reader implementation doesn’t support read records by order or doesn’t support order by operation on reading.

It sounds like the pulsar reader hope the apache iceberg reader guarantee the record-level writing order semantic. As far as I know, the table format based on DFS files is hard to maintain the record-level order semantic. Currently, the iceberg table format was optimized for batch analysis readers, not for a message queue consumer.

I think @rdblue and @RussellSpitzer had discussed the approach about dual writing the iceberg table format and message queue, for downstream reader to consume the records with the writing order, but I don’t see a public discussion or design document to describe the details. Maybe they can provide more input about this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon SageMaker Feature Store now supports Apache ...
Amazon SageMaker Feature Store now supports the ability to create feature groups in the offline store in Apache Iceberg table format.
Read more >
Iceberg AWS Integrations
Iceberg provides integration with different AWS services through the iceberg-aws module. This section describes how to use Iceberg with AWS. Enabling AWS ...
Read more >
Iceberg | DataHub
The DataHub Iceberg source plugin extracts metadata from Iceberg tables stored in a distributed or local file system. Typically, Iceberg tables are stored ......
Read more >
Using Debezium to Create a Data Lake with Apache Iceberg
Apache Iceberg is an "open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, ...
Read more >
Query Apache Iceberg tables | BigQuery - Google Cloud
Apache Iceberg is an open source table format that supports petabyte scale data tables. The Iceberg open specification lets you run multiple ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found