question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Secondary gets into persistently failed state

See original GitHub issue

Using docker-desktop, which is a great testing environment — the cluster goes up and down and all sorts of networking problems happen.

After running with Spilo / Postgres Operator for a while, patroni/spilo gets itself into a terminally bad state:

app-n8n-db-0 postgres 2021-02-16 09:20:41,960 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-0
app-n8n-db-0 postgres 2021-02-16 09:20:42,010 INFO: no action.  i am the leader with the lock
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 WARNING: Postgresql is not running.
app-n8n-db-1 postgres 2021-02-16 09:20:42,018 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,030 INFO: pg_controldata:
app-n8n-db-1 postgres   pg_control version number: 1201
app-n8n-db-1 postgres   Catalog version number: 201909212
app-n8n-db-1 postgres   Database system identifier: 6927231249082552392
app-n8n-db-1 postgres   Database cluster state: shut down in recovery
app-n8n-db-1 postgres   pg_control last modified: Fri Feb 12 07:06:53 2021
app-n8n-db-1 postgres   Latest checkpoint location: 0/52000090
app-n8n-db-1 postgres   Latest checkpoint's REDO location: 0/52000058
app-n8n-db-1 postgres   Latest checkpoint's REDO WAL file: 0000002B0000000000000052
app-n8n-db-1 postgres   Latest checkpoint's TimeLineID: 43
app-n8n-db-1 postgres   Latest checkpoint's PrevTimeLineID: 43
app-n8n-db-1 postgres   Latest checkpoint's full_page_writes: on
app-n8n-db-1 postgres   Latest checkpoint's NextXID: 0:10795
app-n8n-db-1 postgres   Latest checkpoint's NextOID: 42421
app-n8n-db-1 postgres   Latest checkpoint's NextMultiXactId: 1
app-n8n-db-1 postgres   Latest checkpoint's NextMultiOffset: 0
app-n8n-db-1 postgres   Latest checkpoint's oldestXID: 480
app-n8n-db-1 postgres   Latest checkpoint's oldestXID's DB: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestActiveXID: 10795
app-n8n-db-1 postgres   Latest checkpoint's oldestMultiXid: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestMulti's DB: 1
app-n8n-db-1 postgres   Latest checkpoint's oldestCommitTsXid: 0
app-n8n-db-1 postgres   Latest checkpoint's newestCommitTsXid: 0
app-n8n-db-1 postgres   Time of latest checkpoint: Fri Feb 12 02:53:22 2021
app-n8n-db-1 postgres   Fake LSN counter for unlogged rels: 0/3E8
app-n8n-db-1 postgres   Minimum recovery ending location: 0/5206A950
app-n8n-db-1 postgres   Min recovery ending loc's timeline: 43
app-n8n-db-1 postgres   Backup start location: 0/0
app-n8n-db-1 postgres   Backup end location: 0/0
app-n8n-db-1 postgres   End-of-backup record required: no
app-n8n-db-1 postgres   wal_level setting: replica
app-n8n-db-1 postgres   wal_log_hints setting: on
app-n8n-db-1 postgres   max_connections setting: 100
app-n8n-db-1 postgres   max_worker_processes setting: 8
app-n8n-db-1 postgres   max_wal_senders setting: 10
app-n8n-db-1 postgres   max_prepared_xacts setting: 0
app-n8n-db-1 postgres   max_locks_per_xact setting: 64
app-n8n-db-1 postgres   track_commit_timestamp setting: off
app-n8n-db-1 postgres   Maximum data alignment: 8
app-n8n-db-1 postgres   Database block size: 8192
app-n8n-db-1 postgres   Blocks per segment of large relation: 131072
app-n8n-db-1 postgres   WAL block size: 8192
app-n8n-db-1 postgres   Bytes per WAL segment: 16777216
app-n8n-db-1 postgres   Maximum length of identifiers: 64
app-n8n-db-1 postgres   Maximum columns in an index: 32
app-n8n-db-1 postgres   Maximum size of a TOAST chunk: 1996
app-n8n-db-1 postgres   Size of a large-object chunk: 2048
app-n8n-db-1 postgres   Date/time type storage: 64-bit integers
app-n8n-db-1 postgres   Float4 argument passing: by value
app-n8n-db-1 postgres   Float8 argument passing: by value
app-n8n-db-1 postgres   Data page checksum version: 0
app-n8n-db-1 postgres   Mock authentication nonce: 3a5e0f02d33fb40b54d120e7f46f733da05ccb8d61eea848b555a3d7c7109fe3
app-n8n-db-1 postgres
app-n8n-db-1 postgres 2021-02-16 09:20:42,031 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,062 INFO: Local timeline=43 lsn=0/5206A950
app-n8n-db-1 postgres 2021-02-16 09:20:42,076 INFO: master_timeline=105
app-n8n-db-1 postgres 2021-02-16 09:20:42,079 INFO: master: history=40	0/4F0000A0	no recovery target specified
app-n8n-db-1 postgres 41	0/510000A0	no recovery target specified
app-n8n-db-1 postgres 42	0/52000000	no recovery target specified
app-n8n-db-1 postgres 43	0/530000A0	no recovery target specified
app-n8n-db-1 postgres 44	0/530FB538	no recovery target specified
app-n8n-db-1 postgres ...
app-n8n-db-1 postgres 104	0/700000A0	no recovery target specified
app-n8n-db-1 postgres 2021-02-16 09:20:42,080 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,081 INFO: starting as a secondary
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [1-1] 602b8e6a.8f 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [2-1] 602b8e6a.8f 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [3-1] 602b8e6a.8f 0     LOG:  starting PostgreSQL 12.5 (Ubuntu 12.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [4-1] 602b8e6a.8f 0     LOG:  listening on IPv4 address "0.0.0.0", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [5-1] 602b8e6a.8f 0     LOG:  listening on IPv6 address "::", port 5432
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [6-1] 602b8e6a.8f 0     LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
app-n8n-db-1 postgres 2021-02-16 09:20:42,821 INFO: postmaster pid=143
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [7-1] 602b8e6a.8f 0     LOG:  redirecting log output to logging collector process
app-n8n-db-1 postgres 2021-02-16 09:20:42 UTC [143]: [8-1] 602b8e6a.8f 0     HINT:  Future log output will appear in directory "../pg_log".
app-n8n-db-1 postgres /var/run/postgresql:5432 - rejecting connections
app-n8n-db-1 postgres /var/run/postgresql:5432 - no response
app-n8n-db-1 postgres 2021-02-16 09:20:42,937 INFO: Lock owner: app-n8n-db-0; I am app-n8n-db-1
app-n8n-db-1 postgres 2021-02-16 09:20:42,938 INFO: failed to start postgres

And controller logs:

postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing databases" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared databases with schemas" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing prepared database \"analytics\"" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="closing database connection" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="syncing connection pooler from (nil, nil) to (nil, nil)" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=debug msg="could not get connection pooler secret pooler.app-analytics-db.credentials: secrets \"pooler.app-analytics-db.credentials\" not found" cluster-name=app/app-analytics-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:07:46Z" level=info msg="cluster has been synced" cluster-name=app/app-analytics-db pkg=controller worker=0
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=debug msg="unsubscribing from pod \"app/app-n8n-db-1\" events" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=warning msg="error while syncing cluster state: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=cluster
postgres-operator-85db7bc87b-td8l7 postgres-operator time="2021-02-16T09:18:03Z" level=error msg="could not sync cluster: could not sync statefulsets: could not recreate pods: could not recreate replica pod \"app/app-n8n-db-1\": pod label wait timeout" cluster-name=app/app-n8n-db pkg=controller worker=1

Like you can see from the logs above;

DB with index 0 is successfully the leader, and when 1 comes up, it goes into a crash loop. Because the full stack on 1 never starts, the operator can’t do much.

Restarting one or both of the pods makes them come back up in the same faulty state with only 0 working.

For me it seems “no recovery target specified” is key; index 1 has gone down in some fashion and now it can’t resynchronise. But there’s no reason that it shouldn’t be able to.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
CyberDem0ncommented, Feb 17, 2021

Patroni is a general tool for Postgres HA, which could be run on many different OS and environments. It just happened that in Spilo we configured Postgres to write logs to this specific directory in CSV format. Other people might have a different config and logs could be written not only to stderr/files, but also to syslog and eventlog on Windows. Figuring out where exactly they are written requires parsing Postgres config file(s). Yes, there might be many of them, and not all of them are controlled by Patroni. Actually, an error in the config file could be one of the reasons why Postgres fails to start, therefore there is no general solution to the issue.

1reaction
CyberDem0ncommented, Feb 16, 2021

Let’s push that log into the logs

We have a different opinion on that. Postgres logs are quite heavy. I know that some people complain that Patroni logs are too verbose. Well, the volume of postgres logs is about an order of magnitude bigger.

but I’ll keep this issue open until the next time it happens.

I prefer not to do it. Patroni/Spilo is not a magic tool that can automate 100% of failure scenarios. The primary goal is high availability and automatic failover, and it shines in this area. Joining failed nodes back to the cluster is the best effort, and actually it succeeds in more than 99% of cases. The remaining cases are hard to cover. Often it could be a strange behavior/bugs in postgres, configuration issues, or human errors (yes, I’ve seen quite a few times people execing into the container and doing some crazy stuff).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Failed States, Collapsed States, Weak States
Nation-states fail because they are convulsed by internal violence and can no longer deliver positive political goods to their inhabitants.
Read more >
The Rise and Fall of the Failed-State Paradigm - Foreign Affairs
A broken globe in an abandoned school in Ukraine. ... For a decade and a half, from the mid-1990s through about 2010, the...
Read more >
Failed States After 9/11: What Did We Know - jstor
This paper addresses the relationship between accumulated knowledge and U.S. policy dealing with failed states and terrorism. The central.
Read more >
The Greater Middle East: From the “Arab Spring” to the “Axis of ...
This analysis ranks each country's level of success or failure in meeting the needs of its people, and in meeting the hopes of...
Read more >
Money for 'Failing' State Schools Has Finally Been Released
Schools in the “persistently failing” category must be among the lowest performing schools in the state according to the federal government.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found