Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AWS RDS failover triggers "the database system is shutting down" flood

See original GitHub issue

Hi 👋🏻

Thanks for this awesomely maintained & documented library 🙏🏻

Problem

We’re using AWS RDS for PostgreSQL with Multi-AZ and found the failover process taking a little too long for our liking (~5 minutes). During that time, we saw a flood of these errors from our applications:

{
    "code": "57P03",
    "file": "postmaster.c",
    "length": 106,
    "line": "2344",
    "message": "the database system is shutting down",
    "name": "error",
    "routine": "ProcessStartupPacket",
    "severity": "FATAL",
    "stack": "error: the database system is shutting down\n    at Parser.parseErrorMessage (/app/node_modules/pg-protocol/dist/parser.js:287:98)\n    at Parser.handlePacket (/app/node_modules/pg-protocol/dist/parser.js:126:29)\n    at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:39:38)\n    at TLSSocket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)\n    at TLSSocket.emit (events.js:400:28)\n    at addChunk (internal/streams/readable.js:290:12)\n    at readableAddChunk (internal/streams/readable.js:265:9)\n    at TLSSocket.Readable.push (internal/streams/readable.js:204:10)\n    at TLSWrap.onStreamRead (internal/stream_base_commons.js:188:23)",
    "type": "DatabaseError"
}

Digging

From what I understand, the only real option is to force DNS to refresh ASAP. I found pg is, in some part, performing DNS lookups itself? I’m not sure this is the case for just the native component or all connections though.

I’ve also come to learn node has it’s own DNS troubles. The DNS TTL for RDS is 5 seconds, which is quite a bit less than the 60-120 seconds a failover is supposed to take and both are much less than the 5 minute outage we saw.

I can’t honestly be sure if DNS is the issue here but it’s the only angle I’ve got so far. pg doesn’t expose a way to have direct control over DNS resolution of the host name, which is where I’m at.

Question

Is this something pg could manage more gracefully or expose more control over? I’d appreciate any advice to cut down the outage window. However, we are using pg via knex which has it’s own limitations…

Versions

node: v14.18 pg: v8.7.1

Possibly related (old) issue: https://github.com/brianc/node-postgres/issues/1075

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

3reactions

shouspercommented, Jan 25, 2022

Okay, I’ve done about as much thorough testing as I can. I’m honestly not sure any problem necessarily lies with either pg or knex, but I can say that RDS Proxy didn’t solve anything for us. It actually made matters worse.

tl;dr. If you want fast failover, make sure your DNS resolution respects the TTL and use a heartbeat loop to determine if a connection should be discarded.

knex uses 100% prepared statements that trigger a 100% session-pinned connection rate via RDS Proxy. We’ve found some folks hacking on the tarn config along with some event hooks to aggressively close connections. I didn’t find this desirable in our environment. Thrashing of the knex connection pool and having RDS Proxy essentially do the same with every query didn’t seem like the right path to me. Each connection gets pinned, the query is executed, then released, on top of the overhead of establishing a database connection for each query execution isn’t zero overhead. I understand the benefits of what RDS Proxy are offering, but we need to be able to pool connections in our applications for rapid reuse. (Side note: I’m fairly sure RDS Proxy is subscribing to events from the RDS API to know when a failover occurs in order to quickly redirect traffic in such a short period of time.)

After much digging, many simulated failovers & disconnections, neither knex or pg are built to handle the network being ripped out from under your feet. I’m no network engineer, but I believe what RDS are doing when failing over occurs is just that: existing connections aren’t closed gracefully, they’re left dangling. On the plus side, this is a great way to test for such a network event happening. How does one automatically recover when your services are happily running, but the database (or another dependency) simply stops responding & your connections are left trying to send data into the ether? Should a database client be smart enough to tackle this? I don’t know.

My approach to the problem is 2 fold: timeouts & probes.

We’d previously neglected to set global query/statement timeouts on the pg client. Setting these at least yields timely errors for your API calls, etc. however neither pg nor knex are clever enough to infer that a series of timeout errors might mean the connection is dead and should be discarded. In knex, the connection that encountered the timeout isn’t provided with the error, so you cannot close it to have it removed from the pool. The timeout error originates from pg, but isn’t emitted via it’s 'error' event, I believe it’s thrown & captured by knex which is where the relationship is lost.

With this in mind, I created a probe or “heartbeat” loop to attach to each pg.Client (generated by knex). It essentially looks like this:

async function probe(c: pg.Client): Promise<void> {
  let connected = true;
  c.on('end', () => (connected = false));

  try {
    while (connected) {
      await sleep(5_000);
      await withRetry(() => c.query('SELECT 1'), {
        retries: 3,
        minTimeout: 0,
        maxTimeout: 10,
      });
    }
  } catch (error) {
    logger.warn({ error }, 'Probe failed, closing connection');
    await c.end();
  }
}

The query above should be able to return in <10ms, and more than a few failures would indicate a connection issue, so we can end it. Ending the connection in this way can then be identified and thus removed from the pool. knex exposes an afterCreate option for it’s pool (tarn) that can be used to manipulate new pg.Client instances like this.

This completely eliminates the need for RDS Proxy for us. We can failover the database and expect our services to automatically recover in roughly 15-30s (speculation, based on testing) and manage their own pools of connections.

What surprised me about this is I might’ve imagined PostgreSQL to have a heartbeat built into their client protocol… but apparently not? I’m unaware of a better way to deal with similar loss of connection scenarios, and the overhead seems negligible if queries can be serviced faster and failovers are still quite brief.

The “database system is shutting down” error is actually harder (rarer?) to encounter than before. I’ll be doing further testing across all our services later this week to further validate the approach.

3reactions

FagnerMartinsBrackcommented, Jan 9, 2022

Can you let us know if RDS Proxy worked for you after you try it?

Ok, I did test it. I used RDS Proxy, changing the “DB_HOST” field without any other change, including no change to the Pool code. Result: less than 5s downtime after a reboot with failover.

I’ll keep paying AWS for that stability and keep RDS Proxy + Multi-AZ active

Top Results From Across the Web

Identify the root cause for a Multi-AZ failover and restart ... - AWS

I want to know the root cause for the restart, recover, or failover of my Amazon Relational Database Service (Amazon RDS) DB instance....

Identify Why an Amazon RDS DB Instance Failed Over - AWS

If there are no events listed and there are no CloudTrail logs, the failover might be caused by one of the following: Availability...

Amazon RDS FAQs | Cloud Relational Database - AWS

Get answers to the most frequently asked questions about Amazon RDS, a managed relational database service supporting six different commercial and open ...

Resolving a dropped RDS DB connection - AWS

My Amazon Relational Database Service (Amazon RDS) database connections dropped suddenly, which caused unexpected downtime. Why did my DB ...

Rebooting a DB instance - Amazon Relational Database Service

Reboot an Amazon RDS DB instance to apply pending changes to the DB instance. ... The DB instance and its client sessions might...