Secondary gets into persistently failed state
See original GitHub issueUsing docker-desktop, which is a great testing environment — the cluster goes up and down and all sorts of networking problems happen.
After running with Spilo / Postgres Operator for a while, patroni/spilo gets itself into a terminally bad state:
app-n8n-db-0 postgres 2021-02-16 09:20:41,960 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-0
app-n8n-db-0 postgres 2021-02-16 09:20:42,010 INFO: no action. i am the leader with the lock
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 WARNING: Postgresql is not running.
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,030 INFO: pg_controldata:
app-n8n-db-1 postgres pg_control version number: 1201
app-n8n-db-1 postgres Catalog version number: 201909212
app-n8n-db-1 postgres Database system identifier: 6927231249082552392
app-n8n-db-1 postgres Database cluster state: shut down in recovery
app-n8n-db-1 postgres pg_control last modified: Fri Feb 12 07:06:53 2021
app-n8n-db-1 postgres Latest checkpoint location: 0/52000090
app-n8n-db-1 postgres Latest checkpoint's REDO location: 0/52000058
app-n8n-db-1 postgres Latest checkpoint's REDO WAL file: 0000002B0000000000000052
app-n8n-db-1 postgres Latest checkpoint's TimeLineID: 43
app-n8n-db-1 postgres Latest checkpoint's PrevTimeLineID: 43
app-n8n-db-1 postgres Latest checkpoint's full_page_writes: on
app-n8n-db-1 postgres Latest checkpoint's NextXID: 0:10795
app-n8n-db-1 postgres Latest checkpoint's NextOID: 42421
app-n8n-db-1 postgres Latest checkpoint's NextMultiXactId: 1
app-n8n-db-1 postgres Latest checkpoint's NextMultiOffset: 0
app-n8n-db-1 postgres Latest checkpoint's oldestXID: 480
app-n8n-db-1 postgres Latest checkpoint's oldestXID's DB: 1
app-n8n-db-1 postgres Latest checkpoint's oldestActiveXID: 10795
app-n8n-db-1 postgres Latest checkpoint's oldestMultiXid: 1
app-n8n-db-1 postgres Latest checkpoint's oldestMulti's DB: 1
app-n8n-db-1 postgres Latest checkpoint's oldestCommitTsXid: 0
app-n8n-db-1 postgres Latest checkpoint's newestCommitTsXid: 0
app-n8n-db-1 postgres Time of latest checkpoint: Fri Feb 12 02:53:22 2021
app-n8n-db-1 postgres Fake LSN counter for unlogged rels: 0/3E8
app-n8n-db-1 postgres Minimum recovery ending location: 0/5206A950
app-n8n-db-1 postgres Min recovery ending loc's timeline: 43
app-n8n-db-1 postgres Backup start location: 0/0
app-n8n-db-1 postgres Backup end location: 0/0
app-n8n-db-1 postgres End-of-backup record required: no
app-n8n-db-1 postgres wal_level setting: replica
app-n8n-db-1 postgres wal_log_hints setting: on
app-n8n-db-1 postgres max_connections setting: 100
app-n8n-db-1 postgres max_worker_processes setting: 8
app-n8n-db-1 postgres max_wal_senders setting: 10
app-n8n-db-1 postgres max_prepared_xacts setting: 0
app-n8n-db-1 postgres max_locks_per_xact setting: 64
app-n8n-db-1 postgres track_commit_timestamp setting: off
app-n8n-db-1 postgres Maximum data alignment: 8
app-n8n-db-1 postgres Database block size: 8192
app-n8n-db-1 postgres Blocks per segment of large relation: 131072
app-n8n-db-1 postgres WAL block size: 8192
app-n8n-db-1 postgres Bytes per WAL segment: 16777216
app-n8n-db-1 postgres Maximum length of identifiers: 64
app-n8n-db-1 postgres Maximum columns in an index: 32
app-n8n-db-1 postgres Maximum size of a TOAST chunk: 1996
app-n8n-db-1 postgres Size of a large-object chunk: 2048
app-n8n-db-1 postgres Date/time type storage: 64-bit integers
app-n8n-db-1 postgres Float4 argument passing: by value
app-n8n-db-1 postgres Float8 argument passing: by value
app-n8n-db-1 postgres Data page checksum version: 0
app-n8n-db-1 postgres Mock authentication nonce: 3a5e0f02d33fb40b54d120e7f46f733da05ccb8d61eea848b555a3d7c7109fe3
app-n8n-db-1 postgres
app-n8n-db-1 postgres 2021-02-16 09:20:42,031 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,062 INFO: Local timeline=43 lsn=0/5206A950
app-n8n-db-1 postgres 2021-02-16 09:20:42,076 INFO: master_timeline=105
app-n8n-db-1 postgres 2021-02-16 09:20:42,079 INFO: master: history=40 0/4F0000A0 no recovery target specified
app-n8n-db-1 postgres 41 0/510000A0 no recovery target specified
app-n8n-db-1 postgres 42 0/52000000 no recovery target specified
app-n8n-db-1 postgres 43 0/530000A0 no recovery target specified
app-n8n-db-1 postgres 44 0/530FB538 no recovery target specified
app-n8n-db-1 postgres ...
app-n8n-db-1 postgres 104 0/700000A0 no recovery target specified
app-n8n-db-1 postgres 2021-02-16 09:20:42,080 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,081 INFO: starting as a secondary
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [1-1] 602b8e6a.8f 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [2-1] 602b8e6a.8f 0 LOG: pg_stat_kcache.linux_hz is set to 1000000
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [3-1] 602b8e6a.8f 0 LOG: starting PostgreSQL 12.5 (Ubuntu 12.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [4-1] 602b8e6a.8f 0 LOG: listening on IPv4 address "0.0.0.0", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [5-1] 602b8e6a.8f 0 LOG: listening on IPv6 address "::", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [6-1] 602b8e6a.8f 0 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
app-n8n-db-1 postgres 2021-02-16 09:20:42,821 INFO: postmaster pid=143
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [7-1] 602b8e6a.8f 0 LOG: redirecting log output to logging collector process
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [8-1] 602b8e6a.8f 0 HINT: Future log output will appear in directory "../pg_log".
app-n8n-db-1 postgres /var/run/postgresql:5432 - rejecting connections
app-n8n-db-1 postgres /var/run/postgresql:5432 - no response
app-n8n-db-1 postgres 2021-02-16 09:20:42,937 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,938 INFO: failed to start postgres
And controller logs:
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing databases" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared databases with schemas" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared database \"analytics\"" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing connection pooler from (nil, nil) to (nil, nil)" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="could not get connection pooler secret pooler.app-analytics-db.credentials: secrets \"pooler.app-analytics-db.credentials\" not found" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=info msg="cluster has been synced" cluster-name=app/app-analytics-db pkg=controller worker=0
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=debug msg="unsubscribing from pod \"app/app-n8n-db-1\" events" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=error msg="could not sync cluster: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=controller worker=1
Like you can see from the logs above;
DB with index 0
is successfully the leader, and when 1
comes up, it goes into a crash loop. Because the full stack on 1
never starts, the operator can’t do much.
Restarting one or both of the pods makes them come back up in the same faulty state with only 0
working.
For me it seems “no recovery target specified” is key; index 1
has gone down in some fashion and now it can’t resynchronise. But there’s no reason that it shouldn’t be able to.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Failed States, Collapsed States, Weak States
Nation-states fail because they are convulsed by internal violence and can no longer deliver positive political goods to their inhabitants.
Read more >The Rise and Fall of the Failed-State Paradigm - Foreign Affairs
A broken globe in an abandoned school in Ukraine. ... For a decade and a half, from the mid-1990s through about 2010, the...
Read more >Failed States After 9/11: What Did We Know - jstor
This paper addresses the relationship between accumulated knowledge and U.S. policy dealing with failed states and terrorism. The central.
Read more >The Greater Middle East: From the “Arab Spring” to the “Axis of ...
This analysis ranks each country's level of success or failure in meeting the needs of its people, and in meeting the hopes of...
Read more >Money for 'Failing' State Schools Has Finally Been Released
Schools in the “persistently failing” category must be among the lowest performing schools in the state according to the federal government.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Patroni is a general tool for Postgres HA, which could be run on many different OS and environments. It just happened that in Spilo we configured Postgres to write logs to this specific directory in CSV format. Other people might have a different config and logs could be written not only to stderr/files, but also to syslog and eventlog on Windows. Figuring out where exactly they are written requires parsing Postgres config file(s). Yes, there might be many of them, and not all of them are controlled by Patroni. Actually, an error in the config file could be one of the reasons why Postgres fails to start, therefore there is no general solution to the issue.
We have a different opinion on that. Postgres logs are quite heavy. I know that some people complain that Patroni logs are too verbose. Well, the volume of postgres logs is about an order of magnitude bigger.
I prefer not to do it. Patroni/Spilo is not a magic tool that can automate 100% of failure scenarios. The primary goal is high availability and automatic failover, and it shines in this area. Joining failed nodes back to the cluster is the best effort, and actually it succeeds in more than 99% of cases. The remaining cases are hard to cover. Often it could be a strange behavior/bugs in postgres, configuration issues, or human errors (yes, I’ve seen quite a few times people execing into the container and doing some crazy stuff).