How to deal with error code `Local_State` with reason `Erroneous state` after upgrade to 1.9.0 version
See original GitHub issueHi @mhowlett,
recently we observed offsets reset to earliest
on random (partition, consumer-group) pairs in our production environment. This was a huge issue for us as we were flooded with millions of messages causing lag on our consumers. As we guarantee ordered processing without gaps this was a challenge for us.
We believe this was an outcome of bug fixed here: https://github.com/edenhill/librdkafka/pull/3774#issuecomment-1177475167 I’ve tried to ask there if for such conditions we are now guaranteed to observe exception but perhaps @edenhill is ignoring comments in merge requests that are already merged.
Description
Judging by your comments we think 1.9.0 address that case. So we performed an upgrade of Confluent.Kafka
on our development environment and started testing.
After update we observer KafkaException
with code: Local_State
and reason: Erroneous state
. From what I checked it correlates with RD_KAFKA_RESP_ERR__STATE
which was involved in that merge request addressing: https://github.com/edenhill/librdkafka/pull/3774
How to reproduce
We use auto-commit.
After upgrade to 1.9.0
we observe KafkaException(Local_State)
when calling StoreOffset
in case of batch-processing. We already described our approach to batch processing here: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1164#issuecomment-610308425.
We consume batch of messages (for a given time lets say a minute). Once we aggregate our batch we start processing after all messages are processed (it can take minutes) we start to store offsets (preserving order of consumption) for all consumed messages.
In mean time what happens is we loose partition assignments as we observe this KafkaException
with code Local_State
.
We see that this exception is not observed during aggregation of batch on consume phase which make us think this is probably silently handled underneath and subsequent consumes simply returns messages from partitions that are currently assigned.
Questions:
How should we deal with this exception:
- is it safe to log it and resume
StoreOffset
for other messages that were aggregated in batch and after that start aggregation of new batch by calling consume again? - what will consumer instance do under hood once observers this exception during offsets store (will it empty locally fetched queues for unassigned partitions)?
- are we guaranteed that we will reprocess messages in order after this exception without unsubscribe of consumer instance that has observed this exception?
- could our approach lead us to error state when during batch aggregation and processing
2 re-balances would happen
that would end with partition reassigned (but processed offset is not valid from perspective of current assignment) - considering ad.4 is true we could have possible duplicates in aggregated batch. Would then last stored offset for a given partition-topic, always win when it comes to committed offsets?
- what was happening in such case before
1.9.0
we never observed errors before onStoreOffset
(however un-assignments had to happen than too), could it result in gaps as instance that was not owning partition any more stored offsets that were not consumed/processed by current owner? - is commit occurring during consume calls or is it totally separated thread and can occur totally apart from calling consume so we could end committing partially stored offsets of given aggregated batch?
- there are callback like
SetPartitionsLostHandler
can they be used along with auto commit just to notify our self that partitions were unassigned ? - are all callbacks always triggered by calling consume or they can be triggered from different thread not during consume calls?
Checklist
Please provide the following information:
- [+] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
- [1.9.0 ] Confluent.Kafka nuget version.
- [ 2.6.0] Apache Kafka version.
- [-] Client configuration.
- [Linux containers] Operating system.
- [-] Provide logs (with “debug” : “…” as necessary in configuration).
- [-] Provide broker log excerpts.
- [-] Critical issue.
Issue Analytics
- State:
- Created a year ago
- Reactions:8
- Comments:13 (3 by maintainers)
Top GitHub Comments
In the librdkafka PR linked to in the description, librdkafka was updated to cause this error if an attempt is made to commit offsets for partitions that aren’t currently assigned. You could be seeing the error because you are doing that, or due to a bug in librdkafka that allows for this to happen (or some other reason, but the former seems likely given it was an update to librdkafka in 1.9.0). Either way, the only negative thing that will have happened (in the first two cases) is offsets aren’t being committed and another consumer in the group will re-process the messages & you can safely just ignore the exception.
If you provide a small test application that demonstrates the issue, we’d likely get to looking at it sooner.
After trying to deploy at scale we still see this error even after consume, including when calling other APIs. This seems like it would not be handled by any amount of sync. May be related to paused and resumed partitions.
Scenarios I’ve discovered are:
Sync doesn’t appear to always work - we only ignore error during consume and reset the flag after consume exits; yet we still occasionally get erroneous state trying to store offset, I can see in logs it’s after consume has already exited, rebalancing is done and client has the same partition still assigned.
Also seek fails on consume thread, so no possible race with consume whatsoever. We pause partition 1; at some point call consume; rebalancing happens, we get callback that 1 is revoked, then that 1 is assigned; consume exits; we call resume on the same thread - no parallel consume is possible at this point - resume does not fail; we call seek on the same thread - seek fails with erroneous state. In this case we log whether partition is still in Assignment on error, and it still is.