Redshift Long Running Query hang indefinitely: Query completes on RedShift
See original GitHub issueFirst off love the module and use it often for large ETL processes for multiple clients, certainly the best Db connection module on npm (and I often have to connect to literally every type of Db in the same @project).
After upgrading from v6.4 to v7.9 long running queries began to hang indefinitely despite completing successfully on the Redshift (RS) instance. Cant find a similar issue.
Other queries can still be run against RS (and return results)
- during long running query
- and after long running query’s completion on the RS server.
No errors thrown, no evidence of success or failure. Queries work fine if they are shorter.
I’ve tried listening to every error event I can find in the API and using pg.Client rather than pg.Pool but there is no difference.
I’m out of ideas, code below. Help much appreciated
const {Pool} = require('pg'); // "pg": "^7.8.2",
const env = require("../../envs").get;
const DEFAULT_RS_AUTH = env("REDSHIFT_DEFAULTS");
const REDSHIFT_USERS = env("REDSHIFT_USERS");
const Pools = {};
function getPgConfig(userId){
const user = REDSHIFT_USERS[userId];
const config = Object.assign({},DEFAULT_RS_AUTH);
config.user = user.user;
config.password = user.password;
config.idleTimeoutMillis = 0;
config.keepAlive = true //this setting isn't in the API DOCs btw (but makes no difference here)
return config
};
function initPool(userId){
let config = getAuth(userId);
Pools[userId] = new Pool(config);
Pools[userId].on('error', (err, client) => { // never thrown
console.error('Unexpected error on idle client', err,userId)
process.exit(-1)
});
return
}
function getPool(userId){
if(!Pools[userId]){ // init the pool if it doesnt exist ( manage mulitple pools );
initPool(userId);
}
return Pools[userId];
};
async function runQueriesList(){
const queries = ['SQL1','SQL2','....','SQL25','....'];
for(let sql of queries){
// queries 1 through 24 run fine, all complete in less than 3mins
// SQL25 runs and completed on Redshift in 3mins +
// SQL25 never fails or returns in Node,
// no errors thrown, no timeout, other queries can be run with the same pool
let res = await getPool('etl_user').query(sql);
console.log(res);
};
}
runQueriesList()
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Troubleshooting queries - Amazon Redshift
Connection fails; Query hangs; Query takes too long ... to the database appears to hang or time out when running long queries, such...
Read more >Amazon Redshift: Queries never finish running after period idle
2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by...
Read more >Amazon Redshift: detecting queries that are taking ... - Chartio
Detecting queries that are taking unusually long or are run on a higher frequency interval are good candidates for query tuning.
Read more >When AWS Redshift hangs: a story of locked queries and ...
Recently we started using Amazon Redshift as a source of truth for our data ... We queried the table that records recent running...
Read more >Queries appear to hang and sometimes fail to reach the cluster
You experience an issue with queries completing, where the queries appear to be running but hang in the SQL client tool. Sometimes the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I’m not sure, what changes in the library would have surfaced this issue. Could it be, that there were other changes (different OS) in your environment?
This sounds to me, could be related to the fact, that
AWSsomething silently drops TCP connections after 4-5 minutes of inactivity.To keep the connection alive, the TCP keepalive has to be set up with at most 200 seconds of idle timeout.
You should be able to do this with the currently released version, by setting the
keepAlive
parameter totrue
, and modifying the system defaults.The #1847 PR adds the
keepAliveIdleMillis
parameter, to enable setting this in the application.@OTooleMichael It seems OK to me.
(Disclaimer: I’m not an expert; these are just my thoughts on the topic.)
Related discussions: pgsql-odbc, pgsql-hackers/1, pgsql-hackers/2 TLDR: it’s not a solved problem, but future versions of PostgreSQL / libpq might be better at detecting and handling different error conditions.
As I understand it, the current version of PostgreSQL relies on the transport layer being well behaved. If the connection is closed gracefully by either party, both the client and the server will handle it as expected.
Unfortunately, the server could simply disappear or become unresponsive. If setting more aggressive TCP keepalives solves the problem, then most likely a firewall somewhere along the way drops the connection. If you read through the linked discussions, you will find other edge cases, when only some of the components become unresponsive, while others are still working.
After reading this blog post (and later skimming the linked discussions), I don’t think there is a way to detect a “broken” connection reliably without application-level health checks. The (linux specific)
TCP_USER_TIMEOUT
socket option mentioned in the linked blog post cannot be used in nodejs, and it wouldn’t necessarily handle every edge case. Setting aggressive keepalive parameters could help keep the connection alive, but it would not detect broken connections.Quick fixes, some of which you could implement, depending on your use-case:
Simple queries could be run with an appropriate
query_timeout
. Thequery_timeout
is an undocumented parameter, that can be set in the connection parameters, or for individual queries. It applies to the time between sending the query to the server and giving the full response back to the user.If you use a persistent connection to wait for notifications, then regularly running
SELECT 1
with a shortquery_timeout
on this connection could work.Where it gets more complicated, is the legitimately long-running queries. In those cases the connection will have no activity for extended periods, and / or it takes a long time to receive all the rows.
If you expect to receive a lot of rows, you could listen on, and set up a timeout for the
Query
object’srow
event.And finally, you could use the
client.processID
and thepg_stat_activity
view to regularly check in a separate connection, if the query is still running. If the promise hasn’t resolved after the query stopped running, you can destroy the socket (and by doing so, force to resolve the promise).As an aside, whenever you abort query locally because the connection seems broken, it might be a good idea, to use
pg_terminate_backend
, to make sure the server also knows about it.And just a reminder: whenever we use timeouts to abort queries or close connections, there will be a race condition on whether or not the query succeeds.