Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

replica stuck in a loop after PostgreSQL upgrade

See original GitHub issue

Hello,

We upgraded our Patroni managed PostgreSQL cluster from v9.6 to v11.7. The upgrade was fine and the leader node came up too. But, the replica node is having trouble (i.e., it starts streaming from the Primary/Master and then shuts down as per request(from Patroni?). And this cycle repeats…)

Postgres logs from replica:

< 2020-03-24 08:07:59.421 GMT > LOG:  database system was shut down in recovery at 2020-03-24 08:07:59 GMT
< 2020-03-24 08:07:59.421 GMT > LOG:  entering standby mode
< 2020-03-24 08:07:59.421 GMT > FATAL:  the database system is starting up
< 2020-03-24 08:07:59.423 GMT > LOG:  redo starts at 3B/CF19A0C0
< 2020-03-24 08:07:59.424 GMT > LOG:  consistent recovery state reached at 3B/CF1CC180
< 2020-03-24 08:07:59.424 GMT > LOG:  invalid record length at 3B/CF1CC180: wanted 24, got 0
< 2020-03-24 08:07:59.424 GMT > LOG:  database system is ready to accept read only connections
< 2020-03-24 08:07:59.445 GMT > LOG:  started streaming WAL from primary at 3B/CF000000 on timeline 6
< 2020-03-24 08:08:09.556 GMT > LOG:  received fast shutdown request
< 2020-03-24 08:08:09.557 GMT > LOG:  aborting any active transactions
< 2020-03-24 08:08:09.557 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:09.557 GMT > FATAL:  terminating walreceiver process due to administrator command
< 2020-03-24 08:08:09.558 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:09.559 GMT > LOG:  shutting down
< 2020-03-24 08:08:09.565 GMT > LOG:  database system is shut down
< 2020-03-24 08:08:09.879 GMT > LOG:  database system was shut down in recovery at 2020-03-24 08:08:09 GMT
< 2020-03-24 08:08:09.879 GMT > LOG:  entering standby mode
< 2020-03-24 08:08:09.879 GMT > FATAL:  the database system is starting up
< 2020-03-24 08:08:09.881 GMT > LOG:  redo starts at 3B/CF19A0C0
< 2020-03-24 08:08:09.882 GMT > LOG:  consistent recovery state reached at 3B/CF1CC2C8
< 2020-03-24 08:08:09.882 GMT > LOG:  invalid record length at 3B/CF1CC2C8: wanted 24, got 0
< 2020-03-24 08:08:09.882 GMT > LOG:  database system is ready to accept read only connections
< 2020-03-24 08:08:09.902 GMT > LOG:  started streaming WAL from primary at 3B/CF000000 on timeline 6
< 2020-03-24 08:08:20.007 GMT > LOG:  received fast shutdown request
< 2020-03-24 08:08:20.008 GMT > LOG:  aborting any active transactions
< 2020-03-24 08:08:20.008 GMT > FATAL:  terminating connection due to administrator command
< 2020-03-24 08:08:20.009 GMT > FATAL:  terminating walreceiver process due to administrator command
< 2020-03-24 08:08:20.009 GMT > LOG:  shutting down
< 2020-03-24 08:08:20.015 GMT > LOG:  database system is shut down

Corresponding Postgres logs from Master/Primary/Leader:

< 2020-03-24 08:07:59.465 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:09.921 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:20.362 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:30.820 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1
< 2020-03-24 08:08:41.277 GMT > LOG:  standby "abc-02.com" is now a synchronous standby with priority 1

This is what I see in Patroni logs on replica node: (timestamps in EST)

Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,109 INFO: closed patroni connection to the postgresql cluster
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.401 GMT > LOG:  listening on IPv4 address "10.10.10.2", port 59673
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,404 INFO: postmaster pid=90955
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.406 GMT > LOG:  listening on Unix socket "/var/tmp/mdev/.s.PGSQL.59673"
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.418 GMT > LOG:  redirecting log output to logging collector process
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:07:59.418 GMT > HINT:  Future log output will appear in directory "pg_log".
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - rejecting connections
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - accepting connections
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,453 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,454 INFO: does not have lock
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,454 INFO: establishing a new patroni connection to the postgres cluster
Mar 24 04:07:59 abc-02.com patroni-mdev[83301]: 2020-03-24 04:07:59,471 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:00 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:00,213: INFO  no handler for on_restart, replica, mdev
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,440 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,440 INFO: does not have lock
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,448 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,570 INFO: closed patroni connection to the postgresql cluster
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.858 GMT > LOG:  listening on IPv4 address "10.10.10.2", port 59673
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,861 INFO: postmaster pid=91053
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.863 GMT > LOG:  listening on Unix socket "/var/tmp/mdev/.s.PGSQL.59673"
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.876 GMT > LOG:  redirecting log output to logging collector process
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: < 2020-03-24 08:08:09.876 GMT > HINT:  Future log output will appear in directory "pg_log".
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - rejecting connections
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: /var/tmp/mdev:59673 - accepting connections
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,906 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,907 INFO: does not have lock
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,907 INFO: establishing a new patroni connection to the postgres cluster
Mar 24 04:08:09 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:09,927 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:10 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:10,664: INFO  no handler for on_restart, replica, mdev
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,896 INFO: Lock owner: abc-01.com; I am abc-02.com
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,896 INFO: does not have lock
Mar 24 04:08:19 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:19,900 INFO: no action.  i am a secondary and i am following a leader
Mar 24 04:08:20 abc-02.com patroni-mdev[83301]: 2020-03-24 04:08:20,020 INFO: closed patroni connection to the postgresql cluster

Its not clear why Patroni is closing the connection in a loop.

FWIW: I have tried removing the data_dir so that it can bootstrap. The bootstrap happens successfully but this whole process continues.

Patroni version- 1.6.3

We had never seen this issue when we were on Postgres-9.6.

Any help/leads are much appreciated. Thank you!

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

1reaction

RajKiranScommented, Mar 27, 2020

But, what is interesting is that we never faced this issue before the upgrade (Postgres-version: 9.6).

Perhaps this commit: https://github.com/zalando/patroni/commit/85341ff78b9c02f59b4d490e3bf300c352c68e2a answers the above question!

0reactions

RajKiranScommented, Mar 26, 2020

Thanks for responding, @CyberDem0n

please make sure that nothing overwrites/deletes pgpass

Yes, this turned out to be the issue because we had the same user running two different Patroni clusters. But, what is interesting is that we never faced this issue before the upgrade (Postgres-version: 9.6).