question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

AWS RDS failover triggers "the database system is shutting down" flood

See original GitHub issue

Hi šŸ‘‹šŸ»

Thanks for this awesomely maintained & documented library šŸ™šŸ»

Problem

Weā€™re using AWS RDS for PostgreSQL with Multi-AZ and found the failover process taking a little too long for our liking (~5 minutes). During that time, we saw a flood of these errors from our applications:

{
    "code": "57P03",
    "file": "postmaster.c",
    "length": 106,
    "line": "2344",
    "message": "the database system is shutting down",
    "name": "error",
    "routine": "ProcessStartupPacket",
    "severity": "FATAL",
    "stack": "error: the database system is shutting down\n    at Parser.parseErrorMessage (/app/node_modules/pg-protocol/dist/parser.js:287:98)\n    at Parser.handlePacket (/app/node_modules/pg-protocol/dist/parser.js:126:29)\n    at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:39:38)\n    at TLSSocket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)\n    at TLSSocket.emit (events.js:400:28)\n    at addChunk (internal/streams/readable.js:290:12)\n    at readableAddChunk (internal/streams/readable.js:265:9)\n    at TLSSocket.Readable.push (internal/streams/readable.js:204:10)\n    at TLSWrap.onStreamRead (internal/stream_base_commons.js:188:23)",
    "type": "DatabaseError"
}

Digging

From what I understand, the only real option is to force DNS to refresh ASAP. I found pg is, in some part, performing DNS lookups itself? Iā€™m not sure this is the case for just the native component or all connections though.

Iā€™ve also come to learn node has itā€™s own DNS troubles. The DNS TTL for RDS is 5 seconds, which is quite a bit less than the 60-120 seconds a failover is supposed to take and both are much less than the 5 minute outage we saw.

I canā€™t honestly be sure if DNS is the issue here but itā€™s the only angle Iā€™ve got so far. pg doesnā€™t expose a way to have direct control over DNS resolution of the host name, which is where Iā€™m at.

Question

Is this something pg could manage more gracefully or expose more control over? Iā€™d appreciate any advice to cut down the outage window. However, we are using pg via knex which has itā€™s own limitationsā€¦

Versions

node: v14.18 pg: v8.7.1

Related

Possibly related (old) issue: https://github.com/brianc/node-postgres/issues/1075

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

3reactions
shouspercommented, Jan 25, 2022

Okay, Iā€™ve done about as much thorough testing as I can. Iā€™m honestly not sure any problem necessarily lies with either pg or knex, but I can say that RDS Proxy didnā€™t solve anything for us. It actually made matters worse.

tl;dr. If you want fast failover, make sure your DNS resolution respects the TTL and use a heartbeat loop to determine if a connection should be discarded.

knex uses 100% prepared statements that trigger a 100% session-pinned connection rate via RDS Proxy. Weā€™ve found some folks hacking on the tarn config along with some event hooks to aggressively close connections. I didnā€™t find this desirable in our environment. Thrashing of the knex connection pool and having RDS Proxy essentially do the same with every query didnā€™t seem like the right path to me. Each connection gets pinned, the query is executed, then released, on top of the overhead of establishing a database connection for each query execution isnā€™t zero overhead. I understand the benefits of what RDS Proxy are offering, but we need to be able to pool connections in our applications for rapid reuse. (Side note: Iā€™m fairly sure RDS Proxy is subscribing to events from the RDS API to know when a failover occurs in order to quickly redirect traffic in such a short period of time.)

After much digging, many simulated failovers & disconnections, neither knex or pg are built to handle the network being ripped out from under your feet. Iā€™m no network engineer, but I believe what RDS are doing when failing over occurs is just that: existing connections arenā€™t closed gracefully, theyā€™re left dangling. On the plus side, this is a great way to test for such a network event happening. How does one automatically recover when your services are happily running, but the database (or another dependency) simply stops responding & your connections are left trying to send data into the ether? Should a database client be smart enough to tackle this? I donā€™t know.

My approach to the problem is 2 fold: timeouts & probes.

Weā€™d previously neglected to set global query/statement timeouts on the pg client. Setting these at least yields timely errors for your API calls, etc. however neither pg nor knex are clever enough to infer that a series of timeout errors might mean the connection is dead and should be discarded. In knex, the connection that encountered the timeout isnā€™t provided with the error, so you cannot close it to have it removed from the pool. The timeout error originates from pg, but isnā€™t emitted via itā€™s 'error' event, I believe itā€™s thrown & captured by knex which is where the relationship is lost.

With this in mind, I created a probe or ā€œheartbeatā€ loop to attach to each pg.Client (generated by knex). It essentially looks like this:

async function probe(c: pg.Client): Promise<void> {
  let connected = true;
  c.on('end', () => (connected = false));

  try {
    while (connected) {
      await sleep(5_000);
      await withRetry(() => c.query('SELECT 1'), {
        retries: 3,
        minTimeout: 0,
        maxTimeout: 10,
      });
    }
  } catch (error) {
    logger.warn({ error }, 'Probe failed, closing connection');
    await c.end();
  }
}

The query above should be able to return in <10ms, and more than a few failures would indicate a connection issue, so we can end it. Ending the connection in this way can then be identified and thus removed from the pool. knex exposes an afterCreate option for itā€™s pool (tarn) that can be used to manipulate new pg.Client instances like this.

This completely eliminates the need for RDS Proxy for us. We can failover the database and expect our services to automatically recover in roughly 15-30s (speculation, based on testing) and manage their own pools of connections.

What surprised me about this is I mightā€™ve imagined PostgreSQL to have a heartbeat built into their client protocolā€¦ but apparently not? Iā€™m unaware of a better way to deal with similar loss of connection scenarios, and the overhead seems negligible if queries can be serviced faster and failovers are still quite brief.

The ā€œdatabase system is shutting downā€ error is actually harder (rarer?) to encounter than before. Iā€™ll be doing further testing across all our services later this week to further validate the approach.

3reactions
FagnerMartinsBrackcommented, Jan 9, 2022

Can you let us know if RDS Proxy worked for you after you try it?

Ok, I did test it. I used RDS Proxy, changing the ā€œDB_HOSTā€ field without any other change, including no change to the Pool code. Result: less than 5s downtime after a reboot with failover.

Iā€™ll keep paying AWS for that stability and keep RDS Proxy + Multi-AZ active

Read more comments on GitHub >

github_iconTop Results From Across the Web

Identify the root cause for a Multi-AZ failover and restart ... - AWS
I want to know the root cause for the restart, recover, or failover of my Amazon Relational Database Service (Amazon RDS) DB instance....
Read more >
Identify Why an Amazon RDS DB Instance Failed Over - AWS
If there are no events listed and there are no CloudTrail logs, the failover might be caused by one of the following: Availability...
Read more >
Amazon RDS FAQs | Cloud Relational Database - AWS
Get answers to the most frequently asked questions about Amazon RDS, a managed relational database service supporting six different commercial and openĀ ...
Read more >
Resolving a dropped RDS DB connection - AWS
My Amazon Relational Database Service (Amazon RDS) database connections dropped suddenly, which caused unexpected downtime. Why did my DBĀ ...
Read more >
Rebooting a DB instance - Amazon Relational Database Service
Reboot an Amazon RDS DB instance to apply pending changes to the DB instance. ... The DB instance and its client sessions might...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found