question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AWS SDK API call failure metrics

See original GitHub issue

Describe the issue

Hello, I was setting up monitoring for one of our dependencies - AWS AppConfig, specifically on Availability, Latency and Retries. I am using AWS SDK metrics to do this. I am using ApiCallDuration as latency metrics and RetryCount for retries, but since there is no straightforward metric for Availability, I am using the ApiCallSuccessful metric. Taking the SampleCount of ApiCallSuccessful metric ( from and subtracting that with the Sum of the same metric, since ApiCallSuccessful is a boolean metric and it being 0 means the API call failed. See attached screenshot.

When I do this, I see several failed calls regularly and consistently (about 32 every 2 minutes). I checked all our service logs and there are no signs of any failed calls.

Does this mean there is a bug in how these metrics are collected? Or this isn’t how these metrics are supposed to be used? Please let me know how I get Availability metrics reliably using this. Screen Shot 2022-03-28 at 12 21 41 PM

Steps to Reproduce

Not completely sure if this is a bug.

Current behavior

32 “errors” every 2 minutes on reverting ApiCallSuccessful metric and using it as a failed calls metric, but no corresponding errors anywhere in the logs.

AWS Java SDK version used

2

JDK version used

1.8

Operating System and version

Amazon Linux 2

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
HimalayPatelcommented, Apr 22, 2022

@debora-ito, I found the issue. We have turned on Background Polling configuration in the AWS AppConfig client for our service. This means the entries in the AppConfig cache will refresh periodically in the background. We also use a file based cache for cache persistence across host-bounces and deployments. Finally, we have some E2E tests for our service which first create some test configuration profiles on AWS AppConfig, make some requests to our service that use those test profiles and finally clean up those profiles from AWS AppConfig.

The catch here is that no entries are ever deleted / evicted from the AppConfig client cache. This means there are some zombie entries in the AppConfig cache which it tries to refresh periodically. But since there is no corresponding profile for those entries on the remote AWS AppConfig repository, those calls always fail. This is causing the regular and periodic failures in the Availability metrics (the issue in the question).

TL;DR- the issue is not with the AWS SDK metrics per se but with how the AWS AppConfig cache is designed. Thanks for the help on this.

0reactions
debora-itocommented, Apr 22, 2022

Thank you for the follow-up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enabling Metrics for the AWS SDK for Java
The AWS SDK for Java can generate metrics for visualization and monitoring with CloudWatch that measure:
Read more >
Monitoring API requests using Amazon CloudWatch
You can monitor Amazon EC2 API requests using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics.
Read more >
Error Responses - Amazon Simple Storage Service
Error Code Description HTTP Status Code AccessControlListNotSupported The bucket does not allow ACLs. 400 Bad Request AccessDenied Access Denied 403 Forbidden BucketAlreadyOwnedByYou 409 Conflict (in all...
Read more >
Error retries and exponential backoff in AWS
Configure retry settings in the client application when errors occur and use an exponential backoff algorithm for better flow control.
Read more >
Managing and monitoring API throttling in your workloads
When the allotted rate limit for an API call is exceeded, you'll receive an error response and the call will be throttled.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found