Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redshift Long Running Query hang indefinitely: Query completes on RedShift

See original GitHub issue

First off love the module and use it often for large ETL processes for multiple clients, certainly the best Db connection module on npm (and I often have to connect to literally every type of Db in the same @project).

After upgrading from v6.4 to v7.9 long running queries began to hang indefinitely despite completing successfully on the Redshift (RS) instance. Cant find a similar issue.

Other queries can still be run against RS (and return results)

during long running query
and after long running query’s completion on the RS server.

No errors thrown, no evidence of success or failure. Queries work fine if they are shorter.

I’ve tried listening to every error event I can find in the API and using pg.Client rather than pg.Pool but there is no difference.

I’m out of ideas, code below. Help much appreciated

const {Pool} = require('pg'); // "pg": "^7.8.2",
const env = require("../../envs").get;
const DEFAULT_RS_AUTH = env("REDSHIFT_DEFAULTS");
const REDSHIFT_USERS = env("REDSHIFT_USERS");
const Pools = {};
function getPgConfig(userId){
  const user = REDSHIFT_USERS[userId];
  const config = Object.assign({},DEFAULT_RS_AUTH);
  config.user = user.user;
  config.password = user.password;
  config.idleTimeoutMillis = 0;
  config.keepAlive = true //this setting isn't in the API DOCs btw (but makes no difference here)
  return config
};
function initPool(userId){
	let config = getAuth(userId);
    Pools[userId] = new Pool(config);
    Pools[userId].on('error', (err, client) => { // never thrown
      console.error('Unexpected error on idle client', err,userId)
      process.exit(-1)
    });
    return
}
function getPool(userId){ 
  if(!Pools[userId]){ // init the pool if it doesnt exist ( manage mulitple pools );
    initPool(userId);
  }
  return Pools[userId];
};

async function runQueriesList(){
	const queries = ['SQL1','SQL2','....','SQL25','....'];
	for(let sql of queries){
		// queries 1 through 24 run fine, all complete in less than 3mins
		// SQL25 runs and completed on Redshift in 3mins +
		// SQL25 never fails or returns in Node, 
		// no errors thrown, no timeout, other queries can be run with the same pool
		let res = await getPool('etl_user').query(sql);
		console.log(res);
	};
}
runQueriesList()

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

boromispcommented, Mar 27, 2019

I’m not sure, what changes in the library would have surfaced this issue. Could it be, that there were other changes (different OS) in your environment?

This sounds to me, could be related to the fact, that ~~AWS~~ something silently drops TCP connections after 4-5 minutes of inactivity.

To keep the connection alive, the TCP keepalive has to be set up with at most 200 seconds of idle timeout.

You should be able to do this with the currently released version, by setting the keepAlive parameter to true, and modifying the system defaults.

The #1847 PR adds the keepAliveIdleMillis parameter, to enable setting this in the application.

1reaction

boromispcommented, Mar 28, 2019

@OTooleMichael It seems OK to me.

(Disclaimer: I’m not an expert; these are just my thoughts on the topic.)

Related discussions: pgsql-odbc, pgsql-hackers/1, pgsql-hackers/2 TLDR: it’s not a solved problem, but future versions of PostgreSQL / libpq might be better at detecting and handling different error conditions.

As I understand it, the current version of PostgreSQL relies on the transport layer being well behaved. If the connection is closed gracefully by either party, both the client and the server will handle it as expected.

Unfortunately, the server could simply disappear or become unresponsive. If setting more aggressive TCP keepalives solves the problem, then most likely a firewall somewhere along the way drops the connection. If you read through the linked discussions, you will find other edge cases, when only some of the components become unresponsive, while others are still working.

After reading this blog post (and later skimming the linked discussions), I don’t think there is a way to detect a “broken” connection reliably without application-level health checks. The (linux specific) TCP_USER_TIMEOUT socket option mentioned in the linked blog post cannot be used in nodejs, and it wouldn’t necessarily handle every edge case. Setting aggressive keepalive parameters could help keep the connection alive, but it would not detect broken connections.

Quick fixes, some of which you could implement, depending on your use-case:

Simple queries could be run with an appropriate query_timeout. The query_timeout is an undocumented parameter, that can be set in the connection parameters, or for individual queries. It applies to the time between sending the query to the server and giving the full response back to the user.

If you use a persistent connection to wait for notifications, then regularly running SELECT 1 with a short query_timeout on this connection could work.

Where it gets more complicated, is the legitimately long-running queries. In those cases the connection will have no activity for extended periods, and / or it takes a long time to receive all the rows.

If you expect to receive a lot of rows, you could listen on, and set up a timeout for the Query object’s row event.

And finally, you could use the client.processID and the pg_stat_activity view to regularly check in a separate connection, if the query is still running. If the promise hasn’t resolved after the query stopped running, you can destroy the socket (and by doing so, force to resolve the promise).

As an aside, whenever you abort query locally because the connection seems broken, it might be a good idea, to use pg_terminate_backend, to make sure the server also knows about it.

And just a reminder: whenever we use timeouts to abort queries or close connections, there will be a race condition on whether or not the query succeeds.