question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

replica stuck in a loop after PostgreSQL upgrade

See original GitHub issue

Hello,

We upgraded our Patroni managed PostgreSQL cluster from v9.6 to v11.7. The upgrade was fine and the leader node came up too. But, the replica node is having trouble (i.e., it starts streaming from the Primary/Master and then shuts down as per request(from Patroni?). And this cycle repeats…)

Postgres logs from replica:

< 2020-03-24 08:07:59.421 GMT > LOG:  database system was shut down in recovery at 2020-03-24 08:07:59 GMT
< 2020-03-24 08:07:59.421 GMT > LOG:  entering standby mode
< 2020-03-24 08:07:59.421 GMT > FATAL:  the database system is starting up
< 2020-03-24 08:07:59.423 GMT > LOG:  redo starts at 3B/CF19A0C0
< 2020-03-24 08:07:59.424 GMT > LOG:  consistent recovery state reached at 3B/CF1CC180
< 2020-03-24 08:07:59.424 GMT > LOG:  invalid record length at 3B/CF1CC180: wanted 24, got 0
< 2020-03-24 08:07:59.424 GMT > LOG:  database system is ready to accept read only connections
< 2020-03-24 08:07:59.445 GMT > LOG:  started streaming WAL from primary at 3B/CF000000 on timeline 6
< 2020-03-24 08:08:09.556 GMT > LOG:  received fast shutdown request
< 2020-03-24 08:08:09.557 GMT > LOG:  aborting any active transactions
< 2020-03-24 08:08:09.557 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:09.557 GMT > FATAL:  terminating walreceiver process due to administrator command
< 2020-03-24 08:08:09.558 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:09.559 GMT > LOG:  shutting down
< 2020-03-24 08:08:09.565 GMT > LOG:  database system is shut down
< 2020-03-24 08:08:09.879 GMT > LOG:  database system was shut down in recovery at 2020-03-24 08:08:09 GMT
< 2020-03-24 08:08:09.879 GMT > LOG:  entering standby mode
< 2020-03-24 08:08:09.879 GMT > FATAL:  the database system is starting up
< 2020-03-24 08:08:09.881 GMT > LOG:  redo starts at 3B/CF19A0C0
< 2020-03-24 08:08:09.882 GMT > LOG:  consistent recovery state reached at 3B/CF1CC2C8
< 2020-03-24 08:08:09.882 GMT > LOG:  invalid record length at 3B/CF1CC2C8: wanted 24, got 0
< 2020-03-24 08:08:09.882 GMT > LOG:  database system is ready to accept read only connections
< 2020-03-24 08:08:09.902 GMT > LOG:  started streaming WAL from primary at 3B/CF000000 on timeline 6
< 2020-03-24 08:08:20.007 GMT > LOG:  received fast shutdown request
< 2020-03-24 08:08:20.008 GMT > LOG:  aborting any active transactions
< 2020-03-24 08:08:20.008 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:20.009 GMT > FATAL:  terminating walreceiver process due to administrator command
< 2020-03-24 08:08:20.009 GMT > LOG:  shutting down
< 2020-03-24 08:08:20.015 GMT > LOG:  database system is shut down

Corresponding Postgres logs from Master/Primary/Leader:

< 2020-03-24 08:07:59.465 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:09.921 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:20.362 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:30.820 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:41.277 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1

This is what I see in Patroni logs on replica node: (timestamps in EST)

Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,109 INFO: closed patroni connection to the postgresql cluster
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.401 GMT > LOG:  listening on IPv4 address "10.10.10.2", port 59673
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,404 INFO: postmaster pid=90955
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.406 GMT > LOG:  listening on Unix socket "/var/tmp/mdev/.s.PGSQL.59673"
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.418 GMT > LOG:  redirecting log output to logging collector process
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.418 GMT > HINT:  Future log output will appear in directory "pg_log".
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - rejecting connections
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - accepting connections
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,453 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,454 INFO: does not have lock
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,454 INFO: establishing a new patroni connection to the postgres cluster
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,471 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:00 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:00,213: INFO  no handler for on_restart, replica, mdev
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,440 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,440 INFO: does not have lock
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,448 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,570 INFO: closed patroni connection to the postgresql cluster
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.858 GMT > LOG:  listening on IPv4 address "10.10.10.2", port 59673
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,861 INFO: postmaster pid=91053
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.863 GMT > LOG:  listening on Unix socket "/var/tmp/mdev/.s.PGSQL.59673"
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.876 GMT > LOG:  redirecting log output to logging collector process
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.876 GMT > HINT:  Future log output will appear in directory "pg_log".
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - rejecting connections
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - accepting connections
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,906 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,907 INFO: does not have lock
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,907 INFO: establishing a new patroni connection to the postgres cluster
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,927 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:10 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:10,664: INFO  no handler for on_restart, replica, mdev
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,896 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,896 INFO: does not have lock
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,900 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:20 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:20,020 INFO: closed patroni connection to the postgresql cluster

Its not clear why Patroni is closing the connection in a loop.

FWIW: I have tried removing the data_dir so that it can bootstrap. The bootstrap happens successfully but this whole process continues.

Patroni version- 1.6.3

We had never seen this issue when we were on Postgres-9.6.

Any help/leads are much appreciated. Thank you!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
RajKiranScommented, Mar 27, 2020

But, what is interesting is that we never faced this issue before the upgrade (Postgres-version: 9.6).

Perhaps this commit: https://github.com/zalando/patroni/commit/85341ff78b9c02f59b4d490e3bf300c352c68e2a answers the above question!

0reactions
RajKiranScommented, Mar 26, 2020

Thanks for responding, @CyberDem0n

please make sure that nothing overwrites/deletes pgpass

Yes, this turned out to be the issue because we had the same user running two different Patroni clusters. But, what is interesting is that we never faced this issue before the upgrade (Postgres-version: 9.6).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Re: Logical replication hangs up. - PostgreSQL
Hello,. wal_sender_timeout did not help (from default 60s to 600s), but I haven't tested receiver timeout since it was sender complaining ...
Read more >
a Patroni replica can become stuck unable to start ... - GitLab
This should generate the situation where PostgreSQL is in a crash loop. Run gitlab-ctl reconfigure; Try to gitlab-ctl patroni reinitialize- ...
Read more >
Replication conflicts in PostgreSQL and how to deal with them
Replication conflicts can cause problems with streaming replication. This article tells you what they are and how to deal with them.
Read more >
How to handle logical replication conflicts in PostgreSQL
In this post, the resolution is achieved by skipping the transaction that conflicts with existing data.
Read more >
Avoiding infinite loop in a master-master replication
It then gathers a list of which rows have changed since the last ... mechanism developed by 2ndQuadrant for version 9.4 of PostgreSQL....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found