question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG]KOP failed to get partition offset

See original GitHub issue

@hangc0276 @BewareMyPower @codelipenghui 紧急求助

https://github.com/apache/pulsar/pull/11912

这个PR,在asyncReadEntry的时候做了ledger map匹配校验

    public void asyncReadEntry(PositionImpl position, ReadEntryCallback callback, Object ctx) {
        LedgerHandle currentLedger = this.currentLedger;
        if (log.isDebugEnabled()) {
            log.debug("[{}] Reading entry ledger {}: {}", name, position.getLedgerId(), position.getEntryId());
        }
        if (!ledgers.containsKey(position.getLedgerId())) {
            log.error("[{}] Failed to get message with ledger {}:{} the ledgerId does not belong to this topic "
                    + "or has been deleted.", name, position.getLedgerId(), position.getEntryId());
            callback.readEntryFailed(new ManagedLedgerException.NonRecoverableLedgerException("Message not found, "
                + "the ledgerId does not belong to this topic or has been deleted"), ctx);
            return;
        }

但是原有的接口:

   public PositionImpl getFirstPosition() {
        Long ledgerId = ledgers.firstKey();
        if (ledgerId == null) {
            return null;
        }
        if (ledgerId > lastConfirmedEntry.getLedgerId()) {
            checkState(ledgers.get(ledgerId).getEntries() == 0);
            ledgerId = lastConfirmedEntry.getLedgerId();
        }
        return new PositionImpl(ledgerId, -1);
    }

相关日志:

2021-11-03 20:10:04.378 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [XXX/XXX/persistent/__consumer_offsets-partition-39] Start checking if current ledger is full 2021-11-03 20:10:04.381 [main-EventThread] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [XXXX/XXX/persistent/__consumer_offsets-partition-39] Creating a new ledger 2021-11-03 20:10:04.381 [main-EventThread] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [XXX/XXX/persistent/__consumer_offsets-partition-39] Creating ledger, metadata: {component=[B@685c61b2, pulsar/managed-ledger=[B@764fe8e9, application=[B@6d075d7a} - metadata ops timeout : 60 seconds 2021-11-03 20:10:04.382 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - : [XXX/XXX/persistent/__consumer_offsets-partition-39] Start TrimConsumedLedgers. ledgers=[33] totalSize=65 2021-11-03 20:10:04.382 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl -: [XXX/XXX/persistent/__consumer_offsets-partition-39] Slowest consumer ledger id: 34 2021-11-03 20:10:04.386 [main-EventThread] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - : [XXX/XXX/persistent/__consumer_offsets-partition-39] Created new ledger 205

2021-11-03 21:11:15.991 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [XXXX/XXXX/persistent/__consumer_offsets-partition-39] Checking ledger 33 – time-clock: 1635945075991 ms – time-ledger: 1635941404381 ms – expired: true – over-quota: false – current-ledger: 205 – current-retention: 3600000

2021-11-04 10:51:02.032 [pulsar-io-4-8] ERROR i.s.pulsar.handlers.kop.KafkaRequestHandler - [PersistentTopic{topic=persistent://XXXX/XXXX/__consumer_offsets-partition-39}] Failed to get offset for position 33:0 org.apache.bookkeeper.mledger.ManagedLedgerException$NonRecoverableLedgerException: Message not found, the ledgerId does not belong to this topic or has been deleted

2021-11-04 03:08:14.994 [pulsar-io-4-8] WARN o.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - : getFirstPosition ledgerId 205 and lastConfirmedEntryLedgerId is 33 , entryid 0

当前出现场景,当一个分区创建了一个新的ledger,但是没有新消息发送,当前lastConfirmedEntry还是指向该分区的前一个ledger,当这个ledger过期之后,KOP目前去获取当前分区消息数会查询首个offset的时候,一直失败。

io.streamnative.pulsar.handlers.kop.utils.OffsetFinder

    public static PositionImpl getFirstValidPosition(ManagedLedgerImpl managedLedger) {
        PositionImpl firstPosition = managedLedger.getFirstPosition();
        if (firstPosition == null) {
            return null;
        } else {
            return managedLedger.getNextValidPosition(firstPosition);
        }
    }

如果修改KOP的OffsetFinder,是不是获取到firstPosition之后,判断下ledger是否过期?还是直接修改ManagedLedgerImpl的getFirstPosition逻辑?不知道会不会引入其他问题,还请各位专家帮忙看下,谢谢。

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
BewareMyPowercommented, Nov 12, 2021

We should evaluate the influence if we updated lastConfirmedEntry in internalTrimLedgers. Because some methods might rely on the current semantics that lastConfirmedEntry can point to a deleted entry.

Unless we can add a unit test to prove the current semantics might cause some problems, we should not change it.

1reaction
Jason918commented, Nov 9, 2021

Reopen this issue since it might also be a bug at Pulsar side.

+1, it seems that lastConfirmedEntry is not updated in ManagedLedgerImpl#internalTrimLedgers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failed to get end offsets for topic partitions of global store
I have a Kafka stream that fails with this error on startup every time: org.apache.kafka.streams.errors.StreamsException: Failed to get end ...
Read more >
Kafka topic partition has missing offsets - Stack Overflow
In general, Flink commits Kafka offsets when it performs a checkpoint, so there is always some lag if check it using the consumer...
Read more >
Confluent's Python client for Apache Kafka documentation
List of topic+partitions with offset and possibly error set. Return type: list(TopicPartition). Raises: KafkaException. Raises:.
Read more >
KafkaConsumer — kafka-python 2.0.2-dev documentation
Get the last committed offset for the given partition. This offset will be used as the position for the consumer in the event...
Read more >
Consumer Offsets | Redpanda Docs
Kafka consumer tracks the maximum offset it has consumed in each partition and has the capability to commit offsets so that it can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found