Kubernetes API server becomes unresponsive when using pg_restore
See original GitHub issueHi there,
I’m trying to restore a back up from normal postgresql 9.6 to patroni postgresql 10, I’m running pg_restore -Upostgres -d beta -Cc -Ft dump.tar
database starts restoring data but after a while I get this error:
pg_restore: [archiver (db)] error returned by PQputCopyData: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
After checking patroni logs I figured out that when restoring backup kubernetes API server becomes unresponsive and I get timeout error and when patroni receives this I guess tries to restart database and restoration process fails. I’ve removed resource limit from API server but it did not solve issue. Postgres logs in moment of restart:
2018-01-26 11:59:46.190 UTC,,,1102,,5a6b17a4.44e,3,,2018-01-26 11:57:24 UTC,10/31,2669,LOG,00000,"automatic analyze of table ""beta.public.xxx"" system us
age: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.35 s",,,,,,,,,""
2018-01-26 11:59:47.122 UTC,,,1102,,5a6b17a4.44e,4,,2018-01-26 11:57:24 UTC,10/33,2670,LOG,00000,"automatic analyze of table ""beta.public.yyy"" system usage: C
PU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s",,,,,,,,,""
2018-01-26 11:59:47.867 UTC,,,1102,,5a6b17a4.44e,5,,2018-01-26 11:57:24 UTC,10/35,2671,LOG,00000,"automatic analyze of table ""beta.public.zzz"" system usage: CPU:
user: 0.03 s, system: 0.02 s, elapsed: 0.58 s",,,,,,,,,""
2018-01-26 11:59:54.100 UTC,,,1382,"[local]",5a6b183a.566,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.101 UTC,"postgres","beta",1382,"[local]",5a6b183a.566,2,"authentication",2018-01-26 11:59:54 UTC,10/37,0,LOG,00000,"connection authorized: user=postgres database=beta",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,,,1383,"[local]",5a6b183a.567,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,,,1384,"[local]",5a6b183a.568,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,"postgres","postgres",1383,"[local]",5a6b183a.567,2,"authentication",2018-01-26 11:59:54 UTC,13/7,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,""
2018-01-26 11:59:54.106 UTC,"postgres","postgres",1384,"[local]",5a6b183a.568,2,"authentication",2018-01-26 11:59:54 UTC,14/3,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,""
2018-01-26 12:00:07.476 UTC,"postgres","beta",1382,"[local]",5a6b183a.566,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.376 user=postgres database=beta host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:07.493 UTC,"postgres","postgres",1384,"[local]",5a6b183a.568,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.388 user=postgres database=postgres host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:07.494 UTC,"postgres","postgres",1383,"[local]",5a6b183a.567,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.390 user=postgres database=postgres host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:14.741 UTC,,,61,,5a6b14b6.3d,3,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"received fast shutdown request",,,,,,,,,""
2018-01-26 12:00:14.842 UTC,,,61,,5a6b14b6.3d,4,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"aborting any active transactions",,,,,,,,,""
2018-01-26 12:00:14.843 UTC,"postgres","beta",493,"[local]",5a6b166b.1ed,1529,"COPY",2018-01-26 11:52:11 UTC,8/3363,2665,FATAL,57P01,"terminating connection due to administrator command",,,,,"COPY XXX, line 7342251: ""2017-04-05 13:25:00.824206 9008741 ...........;
",,,"pg_restore"
2018-01-26 12:00:14.843 UTC,"postgres","postgres",80,"[local]",5a6b14b8.50,3,"idle",2018-01-26 11:44:56 UTC,5/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"Patroni"
2018-01-26 12:00:14.843 UTC,,,71,,5a6b14b6.47,2,,2018-01-26 11:44:54 UTC,2/0,0,LOG,00000,"pg_cron scheduler shutting down",,,,,,,,,""
2018-01-26 12:00:14.843 UTC,"postgres","postgres",80,"[local]",5a6b14b8.50,4,"idle",2018-01-26 11:44:56 UTC,,0,LOG,00000,"disconnection: session time: 0:15:18.829 user=postgres database=postgres host=[local]",,,,,,,,,"Patroni"
2018-01-26 12:00:14.844 UTC,"postgres","beta",493,"[local]",5a6b166b.1ed,1530,"COPY",2018-01-26 11:52:11 UTC,,0,LOG,00000,"disconnection: session time: 0:08:03.634 user=postgres database=beta host=[local]",,,,,,,,,"pg_restore"
2018-01-26 12:00:14.848 UTC,,,61,,5a6b14b6.3d,5,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"worker process: logical replication launcher (PID 73) exited with exit code 1",,,,,,,,,""
2018-01-26 12:00:14.855 UTC,,,61,,5a6b14b6.3d,6,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"worker process: bg_mon (PID 72) exited with exit code 1",,,,,,,,,""
2018-01-26 12:00:14.862 UTC,,,1400,"[local]",5a6b184e.578,1,"",2018-01-26 12:00:14 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 12:00:14.862 UTC,,,1400,"[local]",5a6b184e.578,2,"",2018-01-26 12:00:14 UTC,,0,LOG,00000,"PID 493 in cancel request did not match any process",,,,,,,,,""
Here is patroni logs in timeout:
2018-01-26 11:53:18,189 INFO: does not have lock
2018-01-26 11:53:18,222 INFO: no action. i am a secondary and i am following a leader
2018-01-26 11:53:29,178 INFO: Lock owner: patroni-0; I am patroni-1
2018-01-26 11:53:29,178 INFO: does not have lock
2018-01-26 11:53:29,271 INFO: no action. i am a secondary and i am following a leader
2018-01-26 11:53:41,770 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:41,770 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:45,119 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:45,119 WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:48,471 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:48,471 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:51,821 ERROR: get_cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
response.begin()
File "/usr/lib/python3.5/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.5/ssl.py", line 791, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.5/ssl.py", line 575, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
.
.
.
After pg_restore crashes API server comes back and works normally. Turning off pgq
and autovacuum
or using patronictl pause
didn’t seem to resolve this.
Multiple issues raise here:
- What causes API to stop responding? Is patroni making lots of requests to it? I’ve checked on normal conditions and didn’t see any issues, also server always had resources like CPU and RAM to use.
- When API isn’t responding why patroni restarts database? Couldn’t it just wait until server comes up without doing any action?
- Should be increase timeout period?
- Is there any other method to restore back up?
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (1 by maintainers)
Top GitHub Comments
Every pod is doing 2 read requests (list pods + list endpoints) every 10 seconds + 1 write request (update pod). In addition to that master is doing update leader endpoint write request. That’s not very much for k8s api.
When Patroni can’t access API, it has no other option rather than demote master to read-only, because it doesn’t know what is going on. It could be that API is not accessible on a specific Pod due to network partitioning and meanwhile some other Pod see that there is no master and promotes.
It’s possible to increase retry_timeout and ttl by calling
patronictl edit-config
on one of the Pods:There is one rule you should follow when changing ttl, loop_wait and retry_timeout:
ttl >= loop_wait + retry_timeout*2
It’s hard to know for sure, but I think that it’s definitely possible.
This looks like a CPU starvation of a pod running Patroni. I don’t think we can do anything here, so closing the issue.