question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to deal with error code `Local_State` with reason `Erroneous state` after upgrade to 1.9.0 version

See original GitHub issue

Hi @mhowlett,

recently we observed offsets reset to earliest on random (partition, consumer-group) pairs in our production environment. This was a huge issue for us as we were flooded with millions of messages causing lag on our consumers. As we guarantee ordered processing without gaps this was a challenge for us.

We believe this was an outcome of bug fixed here: https://github.com/edenhill/librdkafka/pull/3774#issuecomment-1177475167 I’ve tried to ask there if for such conditions we are now guaranteed to observe exception but perhaps @edenhill is ignoring comments in merge requests that are already merged.

Description

Judging by your comments we think 1.9.0 address that case. So we performed an upgrade of Confluent.Kafka on our development environment and started testing.

After update we observer KafkaException with code: Local_State and reason: Erroneous state. From what I checked it correlates with RD_KAFKA_RESP_ERR__STATE which was involved in that merge request addressing: https://github.com/edenhill/librdkafka/pull/3774

How to reproduce

We use auto-commit.

After upgrade to 1.9.0 we observe KafkaException(Local_State) when calling StoreOffset in case of batch-processing. We already described our approach to batch processing here: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1164#issuecomment-610308425.

We consume batch of messages (for a given time lets say a minute). Once we aggregate our batch we start processing after all messages are processed (it can take minutes) we start to store offsets (preserving order of consumption) for all consumed messages.

In mean time what happens is we loose partition assignments as we observe this KafkaException with code Local_State.

We see that this exception is not observed during aggregation of batch on consume phase which make us think this is probably silently handled underneath and subsequent consumes simply returns messages from partitions that are currently assigned.

Questions:

How should we deal with this exception:

  1. is it safe to log it and resume StoreOffset for other messages that were aggregated in batch and after that start aggregation of new batch by calling consume again?
  2. what will consumer instance do under hood once observers this exception during offsets store (will it empty locally fetched queues for unassigned partitions)?
  3. are we guaranteed that we will reprocess messages in order after this exception without unsubscribe of consumer instance that has observed this exception?
  4. could our approach lead us to error state when during batch aggregation and processing 2 re-balances would happen that would end with partition reassigned (but processed offset is not valid from perspective of current assignment)
  5. considering ad.4 is true we could have possible duplicates in aggregated batch. Would then last stored offset for a given partition-topic, always win when it comes to committed offsets?
  6. what was happening in such case before 1.9.0 we never observed errors before on StoreOffset (however un-assignments had to happen than too), could it result in gaps as instance that was not owning partition any more stored offsets that were not consumed/processed by current owner?
  7. is commit occurring during consume calls or is it totally separated thread and can occur totally apart from calling consume so we could end committing partially stored offsets of given aggregated batch?
  8. there are callback like SetPartitionsLostHandler can they be used along with auto commit just to notify our self that partitions were unassigned ?
  9. are all callbacks always triggered by calling consume or they can be triggered from different thread not during consume calls?

Checklist

Please provide the following information:

  • [+] A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • [1.9.0 ] Confluent.Kafka nuget version.
  • [ 2.6.0] Apache Kafka version.
  • [-] Client configuration.
  • [Linux containers] Operating system.
  • [-] Provide logs (with “debug” : “…” as necessary in configuration).
  • [-] Provide broker log excerpts.
  • [-] Critical issue.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:8
  • Comments:13 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mhowlettcommented, Aug 7, 2022

In the librdkafka PR linked to in the description, librdkafka was updated to cause this error if an attempt is made to commit offsets for partitions that aren’t currently assigned. You could be seeing the error because you are doing that, or due to a bug in librdkafka that allows for this to happen (or some other reason, but the former seems likely given it was an update to librdkafka in 1.9.0). Either way, the only negative thing that will have happened (in the first two cases) is offsets aren’t being committed and another consumer in the group will re-process the messages & you can safely just ignore the exception.

If you provide a small test application that demonstrates the issue, we’d likely get to looking at it sooner.

0reactions
sershe-mscommented, Nov 23, 2022

After trying to deploy at scale we still see this error even after consume, including when calling other APIs. This seems like it would not be handled by any amount of sync. May be related to paused and resumed partitions.

Scenarios I’ve discovered are:

  1. Sync doesn’t appear to always work - we only ignore error during consume and reset the flag after consume exits; yet we still occasionally get erroneous state trying to store offset, I can see in logs it’s after consume has already exited, rebalancing is done and client has the same partition still assigned.

  2. Also seek fails on consume thread, so no possible race with consume whatsoever. We pause partition 1; at some point call consume; rebalancing happens, we get callback that 1 is revoked, then that 1 is assigned; consume exits; we call resume on the same thread - no parallel consume is possible at this point - resume does not fail; we call seek on the same thread - seek fails with erroneous state. In this case we log whether partition is still in Assignment on error, and it still is.

Read more comments on GitHub >

github_iconTop Results From Across the Web

confluent_kafka: how to reliably seek before reading data ...
I'm trying to switch Python code from aiokafka to confluent_kafka and having problems with reading historical data. The system has only one ...
Read more >
Class BackendServiceCdnPolicy (1.9.0) | Python client library
Sets a cache TTL for the specified HTTP status code. negative_caching must be enabled to configure negative_caching_policy. Omitting the policy and leaving ...
Read more >
Class RegionTargetTcpProxiesClient (1.9.0) | Python client ...
Return the API endpoint and client cert source for mutual TLS. The client cert source is determined in the following order: (1) if ......
Read more >
Kafka .NET Client
NET client including code samples for the producer and consumer see this guide ... a KafkaException with ErrorCode equal to Local_State (“Erroneous state”)....
Read more >
PyTorch 1.9.0 Now Available
The following code can be compiled, but will raise a c10::Error when executed. This code compiled and executed successfully in the prior release...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found