Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handling connection timeouts / dropped network / db failover

See original GitHub issue

Hello,

I’m working on setting up a health check within our system, and have noticed that our Postgres implementation doesn’t seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back – https://github.com/mysqljs/mysql/issues/821.

We effectively have a class that looks similar to this:

const Promise = require('bluebird');
const pg = require('pg');

class PostgresSQLInterface extends Service {
    constructor(dbConnectionOptions, logger) {
        const defaults = {
            host: 'localhost',
            port: '5432',

            user: '',
            password: null,

            max: 10, // Poolsize
            min: 10,

            keepAlive: true,
            idleTimeoutMillis: 10000
        };

        this.state = {
            config: _.merge({}, defaults, dbConnectionOptions)
        };

        this.createDatabasePool();
    }

    createDatabasePool() {
        this.state.pool = new pg.Pool(this.state.config);

        this.state.pool.on('connect', () => {
            this.state.healthy = true;
            console.log('HEALTHY');
        });

        this.state.pool.on('error', () => {
            this.state.healthy = false;
            console.log('UNHEALTHY');
        });
    }

    *getDatabaseConnection() {
        try {
            const connection = yield this.state.pool.connect();

            return connection;
        }
        catch (error) {
            throw new Errors.DatabaseError("Could not connect to Postgres", {
                originalError: error
            });
        }
    }    
};

In a normal environment, when our application connects, I see it spit out “HEALTHY” x the number of connections it made, as expected. However, there are a few issues:

1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is “Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients.” My expectation would be to see a string of “UNHEALTHY” statements.

2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.

3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.

Am I missing a setting in my configuration, or is this a situation that just hasn’t come up yet for someone else? If it’s the latter, I’d be more than happy to create a fix for the issue, but want to make sure I’m not overlooking something first.

Node version: 6.2.x PG version: 6.0.2 Postgres version: 9.5.x

Thanks!

Issue Analytics

State:
Created 7 years ago
Comments:33 (14 by maintainers)

Top GitHub Comments

10reactions

brianccommented, Jun 9, 2017

I think I’ve found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an end event, but not an error event. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I’ve put a patch out at pg@6.2.4 - please check it out & lemme know if it fixes your issue!

5reactions

je-alcommented, Oct 7, 2019

Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I’m simulating by issuing a docker pause on a db instance and then updating a DNS record to point to another) I would think this is actually not solved. I believe (I’m currently attempting to get a confirmation from AWS) what @jessedpate claims here is right, which would mean we actually need a way to hook in a timeout at the socket level.

Our current workaround amounts to:

        stream.setTimeout(20000);
        stream.on('timeout', () => {
          stream.end();
          connection.emit('end');
        });

added to lib/connection.js.

This is actually possible with the current implementation, since the library allows passing an already existing socket object in the Client configuration, but some libraries (knex) don’t actually have hook-points available to modify the creation of the connection (at least from what I could see), and had us jumping around some loops to achieve the same behaviour (afterCreate on tarn’s config).

All in all, I think it’d be worthwhile to actually allow for configuration of the property, since as mentioned above, it’s available on other drivers.