AWS RDS failover triggers "the database system is shutting down" flood
See original GitHub issueHi šš»
Thanks for this awesomely maintained & documented library šš»
Problem
Weāre using AWS RDS for PostgreSQL with Multi-AZ and found the failover process taking a little too long for our liking (~5 minutes). During that time, we saw a flood of these errors from our applications:
{
"code": "57P03",
"file": "postmaster.c",
"length": 106,
"line": "2344",
"message": "the database system is shutting down",
"name": "error",
"routine": "ProcessStartupPacket",
"severity": "FATAL",
"stack": "error: the database system is shutting down\n at Parser.parseErrorMessage (/app/node_modules/pg-protocol/dist/parser.js:287:98)\n at Parser.handlePacket (/app/node_modules/pg-protocol/dist/parser.js:126:29)\n at Parser.parse (/app/node_modules/pg-protocol/dist/parser.js:39:38)\n at TLSSocket.<anonymous> (/app/node_modules/pg-protocol/dist/index.js:11:42)\n at TLSSocket.emit (events.js:400:28)\n at addChunk (internal/streams/readable.js:290:12)\n at readableAddChunk (internal/streams/readable.js:265:9)\n at TLSSocket.Readable.push (internal/streams/readable.js:204:10)\n at TLSWrap.onStreamRead (internal/stream_base_commons.js:188:23)",
"type": "DatabaseError"
}
Digging
From what I understand, the only real option is to force DNS to refresh ASAP. I found pg
is, in some part, performing DNS lookups itself? Iām not sure this is the case for just the native component or all connections though.
Iāve also come to learn node has itās own DNS troubles. The DNS TTL for RDS is 5 seconds, which is quite a bit less than the 60-120 seconds a failover is supposed to take and both are much less than the 5 minute outage we saw.
I canāt honestly be sure if DNS is the issue here but itās the only angle Iāve got so far. pg
doesnāt expose a way to have direct control over DNS resolution of the host name, which is where Iām at.
Question
Is this something pg
could manage more gracefully or expose more control over? Iād appreciate any advice to cut down the outage window. However, we are using pg
via knex
which has itās own limitationsā¦
Versions
node
: v14.18
pg
: v8.7.1
Related
Possibly related (old) issue: https://github.com/brianc/node-postgres/issues/1075
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top GitHub Comments
Okay, Iāve done about as much thorough testing as I can. Iām honestly not sure any problem necessarily lies with either
pg
orknex
, but I can say that RDS Proxy didnāt solve anything for us. It actually made matters worse.knex
uses 100% prepared statements that trigger a 100% session-pinned connection rate via RDS Proxy. Weāve found some folks hacking on thetarn
config along with some event hooks to aggressively close connections. I didnāt find this desirable in our environment. Thrashing of theknex
connection pool and having RDS Proxy essentially do the same with every query didnāt seem like the right path to me. Each connection gets pinned, the query is executed, then released, on top of the overhead of establishing a database connection for each query execution isnāt zero overhead. I understand the benefits of what RDS Proxy are offering, but we need to be able to pool connections in our applications for rapid reuse. (Side note: Iām fairly sure RDS Proxy is subscribing to events from the RDS API to know when a failover occurs in order to quickly redirect traffic in such a short period of time.)After much digging, many simulated failovers & disconnections, neither
knex
orpg
are built to handle the network being ripped out from under your feet. Iām no network engineer, but I believe what RDS are doing when failing over occurs is just that: existing connections arenāt closed gracefully, theyāre left dangling. On the plus side, this is a great way to test for such a network event happening. How does one automatically recover when your services are happily running, but the database (or another dependency) simply stops responding & your connections are left trying to send data into the ether? Should a database client be smart enough to tackle this? I donāt know.My approach to the problem is 2 fold: timeouts & probes.
Weād previously neglected to set global query/statement timeouts on the
pg
client. Setting these at least yields timely errors for your API calls, etc. however neitherpg
norknex
are clever enough to infer that a series of timeout errors might mean the connection is dead and should be discarded. Inknex
, the connection that encountered the timeout isnāt provided with the error, so you cannot close it to have it removed from the pool. The timeout error originates frompg
, but isnāt emitted via itās'error'
event, I believe itās thrown & captured byknex
which is where the relationship is lost.With this in mind, I created a probe or āheartbeatā loop to attach to each
pg.Client
(generated byknex
). It essentially looks like this:The query above should be able to return in <10ms, and more than a few failures would indicate a connection issue, so we can end it. Ending the connection in this way can then be identified and thus removed from the pool.
knex
exposes anafterCreate
option for itās pool (tarn
) that can be used to manipulate newpg.Client
instances like this.This completely eliminates the need for RDS Proxy for us. We can failover the database and expect our services to automatically recover in roughly 15-30s (speculation, based on testing) and manage their own pools of connections.
What surprised me about this is I mightāve imagined PostgreSQL to have a heartbeat built into their client protocolā¦ but apparently not? Iām unaware of a better way to deal with similar loss of connection scenarios, and the overhead seems negligible if queries can be serviced faster and failovers are still quite brief.
The ādatabase system is shutting downā error is actually harder (rarer?) to encounter than before. Iāll be doing further testing across all our services later this week to further validate the approach.
Ok, I did test it. I used RDS Proxy, changing the āDB_HOSTā field without any other change, including no change to the
Pool
code. Result: less than 5s downtime after a reboot with failover.Iāll keep paying AWS for that stability and keep RDS Proxy + Multi-AZ active