AWS SDK API call failure metrics
See original GitHub issueDescribe the issue
Hello, I was setting up monitoring for one of our dependencies - AWS AppConfig, specifically on Availability, Latency and Retries. I am using AWS SDK metrics to do this.
I am using ApiCallDuration
as latency metrics and RetryCount
for retries, but since there is no straightforward metric for Availability, I am using the ApiCallSuccessful
metric. Taking the SampleCount
of ApiCallSuccessful
metric ( from and subtracting that with the Sum
of the same metric, since ApiCallSuccessful
is a boolean metric and it being 0 means the API call failed. See attached screenshot.
When I do this, I see several failed calls regularly and consistently (about 32 every 2 minutes). I checked all our service logs and there are no signs of any failed calls.
Does this mean there is a bug in how these metrics are collected? Or this isn’t how these metrics are supposed to be used? Please let me know how I get Availability metrics reliably using this.
Steps to Reproduce
Not completely sure if this is a bug.
Current behavior
32 “errors” every 2 minutes on reverting ApiCallSuccessful
metric and using it as a failed calls metric, but no corresponding errors anywhere in the logs.
AWS Java SDK version used
2
JDK version used
1.8
Operating System and version
Amazon Linux 2
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
@debora-ito, I found the issue. We have turned on Background Polling configuration in the AWS AppConfig client for our service. This means the entries in the AppConfig cache will refresh periodically in the background. We also use a file based cache for cache persistence across host-bounces and deployments. Finally, we have some E2E tests for our service which first create some test configuration profiles on AWS AppConfig, make some requests to our service that use those test profiles and finally clean up those profiles from AWS AppConfig.
The catch here is that no entries are ever deleted / evicted from the AppConfig client cache. This means there are some zombie entries in the AppConfig cache which it tries to refresh periodically. But since there is no corresponding profile for those entries on the remote AWS AppConfig repository, those calls always fail. This is causing the regular and periodic failures in the Availability metrics (the issue in the question).
TL;DR- the issue is not with the AWS SDK metrics per se but with how the AWS AppConfig cache is designed. Thanks for the help on this.
Thank you for the follow-up.