question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Redshift Long Running Query hang indefinitely: Query completes on RedShift

See original GitHub issue

First off love the module and use it often for large ETL processes for multiple clients, certainly the best Db connection module on npm (and I often have to connect to literally every type of Db in the same @project).

After upgrading from v6.4 to v7.9 long running queries began to hang indefinitely despite completing successfully on the Redshift (RS) instance. Cant find a similar issue.

Other queries can still be run against RS (and return results)

  • during long running query
  • and after long running query’s completion on the RS server.

No errors thrown, no evidence of success or failure. Queries work fine if they are shorter.

I’ve tried listening to every error event I can find in the API and using pg.Client rather than pg.Pool but there is no difference.

I’m out of ideas, code below. Help much appreciated

const {Pool} = require('pg'); // "pg": "^7.8.2",
const env = require("../../envs").get;
const DEFAULT_RS_AUTH = env("REDSHIFT_DEFAULTS");
const REDSHIFT_USERS = env("REDSHIFT_USERS");
const Pools = {};
function getPgConfig(userId){
  const user = REDSHIFT_USERS[userId];
  const config = Object.assign({},DEFAULT_RS_AUTH);
  config.user = user.user;
  config.password = user.password;
  config.idleTimeoutMillis = 0;
  config.keepAlive = true //this setting isn't in the API DOCs btw (but makes no difference here)
  return config
};
function initPool(userId){
	let config = getAuth(userId);
    Pools[userId] = new Pool(config);
    Pools[userId].on('error', (err, client) => { // never thrown
      console.error('Unexpected error on idle client', err,userId)
      process.exit(-1)
    });
    return
}
function getPool(userId){ 
  if(!Pools[userId]){ // init the pool if it doesnt exist ( manage mulitple pools );
    initPool(userId);
  }
  return Pools[userId];
};

async function runQueriesList(){
	const queries = ['SQL1','SQL2','....','SQL25','....'];
	for(let sql of queries){
		// queries 1 through 24 run fine, all complete in less than 3mins
		// SQL25 runs and completed on Redshift in 3mins +
		// SQL25 never fails or returns in Node, 
		// no errors thrown, no timeout, other queries can be run with the same pool
		let res = await getPool('etl_user').query(sql);
		console.log(res);
	};
}
runQueriesList()

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
boromispcommented, Mar 27, 2019

I’m not sure, what changes in the library would have surfaced this issue. Could it be, that there were other changes (different OS) in your environment?

This sounds to me, could be related to the fact, that AWS something silently drops TCP connections after 4-5 minutes of inactivity.

To keep the connection alive, the TCP keepalive has to be set up with at most 200 seconds of idle timeout.

You should be able to do this with the currently released version, by setting the keepAlive parameter to true, and modifying the system defaults.

The #1847 PR adds the keepAliveIdleMillis parameter, to enable setting this in the application.

1reaction
boromispcommented, Mar 28, 2019

@OTooleMichael It seems OK to me.

(Disclaimer: I’m not an expert; these are just my thoughts on the topic.)

Related discussions: pgsql-odbc, pgsql-hackers/1, pgsql-hackers/2 TLDR: it’s not a solved problem, but future versions of PostgreSQL / libpq might be better at detecting and handling different error conditions.

As I understand it, the current version of PostgreSQL relies on the transport layer being well behaved. If the connection is closed gracefully by either party, both the client and the server will handle it as expected.

Unfortunately, the server could simply disappear or become unresponsive. If setting more aggressive TCP keepalives solves the problem, then most likely a firewall somewhere along the way drops the connection. If you read through the linked discussions, you will find other edge cases, when only some of the components become unresponsive, while others are still working.

After reading this blog post (and later skimming the linked discussions), I don’t think there is a way to detect a “broken” connection reliably without application-level health checks. The (linux specific) TCP_USER_TIMEOUT socket option mentioned in the linked blog post cannot be used in nodejs, and it wouldn’t necessarily handle every edge case. Setting aggressive keepalive parameters could help keep the connection alive, but it would not detect broken connections.

Quick fixes, some of which you could implement, depending on your use-case:

Simple queries could be run with an appropriate query_timeout. The query_timeout is an undocumented parameter, that can be set in the connection parameters, or for individual queries. It applies to the time between sending the query to the server and giving the full response back to the user.

If you use a persistent connection to wait for notifications, then regularly running SELECT 1 with a short query_timeout on this connection could work.

Where it gets more complicated, is the legitimately long-running queries. In those cases the connection will have no activity for extended periods, and / or it takes a long time to receive all the rows.

If you expect to receive a lot of rows, you could listen on, and set up a timeout for the Query object’s row event.

And finally, you could use the client.processID and the pg_stat_activity view to regularly check in a separate connection, if the query is still running. If the promise hasn’t resolved after the query stopped running, you can destroy the socket (and by doing so, force to resolve the promise).

As an aside, whenever you abort query locally because the connection seems broken, it might be a good idea, to use pg_terminate_backend, to make sure the server also knows about it.

And just a reminder: whenever we use timeouts to abort queries or close connections, there will be a race condition on whether or not the query succeeds.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting queries - Amazon Redshift
Connection fails; Query hangs; Query takes too long ... to the database appears to hang or time out when running long queries, such...
Read more >
Amazon Redshift: Queries never finish running after period idle
2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by...
Read more >
Amazon Redshift: detecting queries that are taking ... - Chartio
Detecting queries that are taking unusually long or are run on a higher frequency interval are good candidates for query tuning.
Read more >
When AWS Redshift hangs: a story of locked queries and ...
Recently we started using Amazon Redshift as a source of truth for our data ... We queried the table that records recent running...
Read more >
Queries appear to hang and sometimes fail to reach the cluster
You experience an issue with queries completing, where the queries appear to be running but hang in the SQL client tool. Sometimes the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found