Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Secondary gets into persistently failed state

See original GitHub issue

Using docker-desktop, which is a great testing environment — the cluster goes up and down and all sorts of networking problems happen.

After running with Spilo / Postgres Operator for a while, patroni/spilo gets itself into a terminally bad state:

app-n8n-db-0 postgres 2021-02-16 09:20:41,960 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-0
app-n8n-db-0 postgres 2021-02-16 09:20:42,010 INFO: no action.  i am the leader with the lock
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 WARNING: Postgresql is not running.
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,030 INFO: pg_controldata:
app-n8n-db-1 postgres   pg_control version number: 1201
app-n8n-db-1 postgres   Catalog version number: 201909212
app-n8n-db-1 postgres   Database system identifier: 6927231249082552392
app-n8n-db-1 postgres   Database cluster state: shut down in recovery
app-n8n-db-1 postgres   pg_control last modified: Fri Feb 12 07:06:53 2021
app-n8n-db-1 postgres   Latest checkpoint location: 0/52000090
app-n8n-db-1 postgres   Latest checkpoint's REDO location: 0/52000058
app-n8n-db-1 postgres   Latest checkpoint's REDO WAL file: 0000002B0000000000000052
app-n8n-db-1 postgres   Latest checkpoint's TimeLineID: 43
app-n8n-db-1 postgres   Latest checkpoint's PrevTimeLineID: 43
app-n8n-db-1 postgres   Latest checkpoint's full_page_writes: on
app-n8n-db-1 postgres   Latest checkpoint's NextXID: 0:10795
app-n8n-db-1 postgres   Latest checkpoint's NextOID: 42421
app-n8n-db-1 postgres   Latest checkpoint's NextMultiXactId: 1
app-n8n-db-1 postgres   Latest checkpoint's NextMultiOffset: 0
app-n8n-db-1 postgres   Latest checkpoint's oldestXID: 480
app-n8n-db-1 postgres   Latest checkpoint's oldestXID's DB: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestActiveXID: 10795
app-n8n-db-1 postgres   Latest checkpoint's oldestMultiXid: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestMulti's DB: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestCommitTsXid: 0
app-n8n-db-1 postgres   Latest checkpoint's newestCommitTsXid: 0
app-n8n-db-1 postgres   Time of latest checkpoint: Fri Feb 12 02:53:22 2021
app-n8n-db-1 postgres   Fake LSN counter for unlogged rels: 0/3E8
app-n8n-db-1 postgres   Minimum recovery ending location: 0/5206A950
app-n8n-db-1 postgres   Min recovery ending loc's timeline: 43
app-n8n-db-1 postgres   Backup start location: 0/0
app-n8n-db-1 postgres   Backup end location: 0/0
app-n8n-db-1 postgres   End-of-backup record required: no
app-n8n-db-1 postgres   wal_level setting: replica
app-n8n-db-1 postgres   wal_log_hints setting: on
app-n8n-db-1 postgres   max_connections setting: 100
app-n8n-db-1 postgres   max_worker_processes setting: 8
app-n8n-db-1 postgres   max_wal_senders setting: 10
app-n8n-db-1 postgres   max_prepared_xacts setting: 0
app-n8n-db-1 postgres   max_locks_per_xact setting: 64
app-n8n-db-1 postgres   track_commit_timestamp setting: off
app-n8n-db-1 postgres   Maximum data alignment: 8
app-n8n-db-1 postgres   Database block size: 8192
app-n8n-db-1 postgres   Blocks per segment of large relation: 131072
app-n8n-db-1 postgres   WAL block size: 8192
app-n8n-db-1 postgres   Bytes per WAL segment: 16777216
app-n8n-db-1 postgres   Maximum length of identifiers: 64
app-n8n-db-1 postgres   Maximum columns in an index: 32
app-n8n-db-1 postgres   Maximum size of a TOAST chunk: 1996
app-n8n-db-1 postgres   Size of a large-object chunk: 2048
app-n8n-db-1 postgres   Date/time type storage: 64-bit integers
app-n8n-db-1 postgres   Float4 argument passing: by value
app-n8n-db-1 postgres   Float8 argument passing: by value
app-n8n-db-1 postgres   Data page checksum version: 0
app-n8n-db-1 postgres   Mock authentication nonce: 3a5e0f02d33fb40b54d120e7f46f733da05ccb8d61eea848b555a3d7c7109fe3
app-n8n-db-1 postgres
app-n8n-db-1 postgres 2021-02-16 09:20:42,031 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,062 INFO: Local timeline=43 lsn=0/5206A950
app-n8n-db-1 postgres 2021-02-16 09:20:42,076 INFO: master_timeline=105
app-n8n-db-1 postgres 2021-02-16 09:20:42,079 INFO: master: history=40	0/4F0000A0	no recovery target specified
app-n8n-db-1 postgres 41	0/510000A0	no recovery target specified
app-n8n-db-1 postgres 42	0/52000000	no recovery target specified
app-n8n-db-1 postgres 43	0/530000A0	no recovery target specified
app-n8n-db-1 postgres 44	0/530FB538	no recovery target specified
app-n8n-db-1 postgres ...
app-n8n-db-1 postgres 104	0/700000A0	no recovery target specified
app-n8n-db-1 postgres 2021-02-16 09:20:42,080 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,081 INFO: starting as a secondary
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [1-1] 602b8e6a.8f 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [2-1] 602b8e6a.8f 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [3-1] 602b8e6a.8f 0     LOG:  starting PostgreSQL 12.5 (Ubuntu 12.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [4-1] 602b8e6a.8f 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [5-1] 602b8e6a.8f 0     LOG:  listening on IPv6 address "::", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [6-1] 602b8e6a.8f 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
app-n8n-db-1 postgres 2021-02-16 09:20:42,821 INFO: postmaster pid=143
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [7-1] 602b8e6a.8f 0     LOG:  redirecting log output to logging collector process
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [8-1] 602b8e6a.8f 0     HINT:  Future log output will appear in directory "../pg_log".
app-n8n-db-1 postgres /var/run/postgresql:5432 - rejecting connections
app-n8n-db-1 postgres /var/run/postgresql:5432 - no response
app-n8n-db-1 postgres 2021-02-16 09:20:42,937 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,938 INFO: failed to start postgres

And controller logs:

postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing databases" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared databases with schemas" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared database \"analytics\"" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing connection pooler from (nil, nil) to (nil, nil)" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="could not get connection pooler secret pooler.app-analytics-db.credentials: secrets \"pooler.app-analytics-db.credentials\" not found" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=info msg="cluster has been synced" cluster-name=app/app-analytics-db pkg=controller worker=0
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=debug msg="unsubscribing from pod \"app/app-n8n-db-1\" events" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=error msg="could not sync cluster: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=controller worker=1

Like you can see from the logs above;

DB with index 0 is successfully the leader, and when 1 comes up, it goes into a crash loop. Because the full stack on 1 never starts, the operator can’t do much.

Restarting one or both of the pods makes them come back up in the same faulty state with only 0 working.

For me it seems “no recovery target specified” is key; index 1 has gone down in some fashion and now it can’t resynchronise. But there’s no reason that it shouldn’t be able to.

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

CyberDem0ncommented, Feb 17, 2021

Patroni is a general tool for Postgres HA, which could be run on many different OS and environments. It just happened that in Spilo we configured Postgres to write logs to this specific directory in CSV format. Other people might have a different config and logs could be written not only to stderr/files, but also to syslog and eventlog on Windows. Figuring out where exactly they are written requires parsing Postgres config file(s). Yes, there might be many of them, and not all of them are controlled by Patroni. Actually, an error in the config file could be one of the reasons why Postgres fails to start, therefore there is no general solution to the issue.

1reaction

CyberDem0ncommented, Feb 16, 2021

Let’s push that log into the logs

We have a different opinion on that. Postgres logs are quite heavy. I know that some people complain that Patroni logs are too verbose. Well, the volume of postgres logs is about an order of magnitude bigger.

but I’ll keep this issue open until the next time it happens.

I prefer not to do it. Patroni/Spilo is not a magic tool that can automate 100% of failure scenarios. The primary goal is high availability and automatic failover, and it shines in this area. Joining failed nodes back to the cluster is the best effort, and actually it succeeds in more than 99% of cases. The remaining cases are hard to cover. Often it could be a strange behavior/bugs in postgres, configuration issues, or human errors (yes, I’ve seen quite a few times people execing into the container and doing some crazy stuff).