Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes API server becomes unresponsive when using pg_restore

See original GitHub issue

Hi there, I’m trying to restore a back up from normal postgresql 9.6 to patroni postgresql 10, I’m running pg_restore -Upostgres -d beta -Cc -Ft dump.tar database starts restoring data but after a while I get this error:

pg_restore: [archiver (db)] error returned by PQputCopyData: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

After checking patroni logs I figured out that when restoring backup kubernetes API server becomes unresponsive and I get timeout error and when patroni receives this I guess tries to restart database and restoration process fails. I’ve removed resource limit from API server but it did not solve issue. Postgres logs in moment of restart:

2018-01-26 11:59:46.190 UTC,,,1102,,5a6b17a4.44e,3,,2018-01-26 11:57:24 UTC,10/31,2669,LOG,00000,"automatic analyze of table ""beta.public.xxx"" system us
age: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.35 s",,,,,,,,,""
2018-01-26 11:59:47.122 UTC,,,1102,,5a6b17a4.44e,4,,2018-01-26 11:57:24 UTC,10/33,2670,LOG,00000,"automatic analyze of table ""beta.public.yyy"" system usage: C
PU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s",,,,,,,,,""
2018-01-26 11:59:47.867 UTC,,,1102,,5a6b17a4.44e,5,,2018-01-26 11:57:24 UTC,10/35,2671,LOG,00000,"automatic analyze of table ""beta.public.zzz"" system usage: CPU: 
user: 0.03 s, system: 0.02 s, elapsed: 0.58 s",,,,,,,,,""
2018-01-26 11:59:54.100 UTC,,,1382,"[local]",5a6b183a.566,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.101 UTC,"postgres","beta",1382,"[local]",5a6b183a.566,2,"authentication",2018-01-26 11:59:54 UTC,10/37,0,LOG,00000,"connection authorized: user=postgres database=beta",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,,,1383,"[local]",5a6b183a.567,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,,,1384,"[local]",5a6b183a.568,1,"",2018-01-26 11:59:54 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 11:59:54.105 UTC,"postgres","postgres",1383,"[local]",5a6b183a.567,2,"authentication",2018-01-26 11:59:54 UTC,13/7,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,""
2018-01-26 11:59:54.106 UTC,"postgres","postgres",1384,"[local]",5a6b183a.568,2,"authentication",2018-01-26 11:59:54 UTC,14/3,0,LOG,00000,"connection authorized: user=postgres database=postgres",,,,,,,,,""
2018-01-26 12:00:07.476 UTC,"postgres","beta",1382,"[local]",5a6b183a.566,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.376 user=postgres database=beta host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:07.493 UTC,"postgres","postgres",1384,"[local]",5a6b183a.568,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.388 user=postgres database=postgres host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:07.494 UTC,"postgres","postgres",1383,"[local]",5a6b183a.567,3,"idle",2018-01-26 11:59:54 UTC,,0,LOG,00000,"disconnection: session time: 0:00:13.390 user=postgres database=postgres host=[local]",,,,,,,,,"pgq ticker"
2018-01-26 12:00:14.741 UTC,,,61,,5a6b14b6.3d,3,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"received fast shutdown request",,,,,,,,,""
2018-01-26 12:00:14.842 UTC,,,61,,5a6b14b6.3d,4,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"aborting any active transactions",,,,,,,,,""
2018-01-26 12:00:14.843 UTC,"postgres","beta",493,"[local]",5a6b166b.1ed,1529,"COPY",2018-01-26 11:52:11 UTC,8/3363,2665,FATAL,57P01,"terminating connection due to administrator command",,,,,"COPY XXX, line 7342251: ""2017-04-05 13:25:00.824206        9008741 ...........;
",,,"pg_restore"
2018-01-26 12:00:14.843 UTC,"postgres","postgres",80,"[local]",5a6b14b8.50,3,"idle",2018-01-26 11:44:56 UTC,5/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"Patroni"
2018-01-26 12:00:14.843 UTC,,,71,,5a6b14b6.47,2,,2018-01-26 11:44:54 UTC,2/0,0,LOG,00000,"pg_cron scheduler shutting down",,,,,,,,,""
2018-01-26 12:00:14.843 UTC,"postgres","postgres",80,"[local]",5a6b14b8.50,4,"idle",2018-01-26 11:44:56 UTC,,0,LOG,00000,"disconnection: session time: 0:15:18.829 user=postgres database=postgres host=[local]",,,,,,,,,"Patroni"
2018-01-26 12:00:14.844 UTC,"postgres","beta",493,"[local]",5a6b166b.1ed,1530,"COPY",2018-01-26 11:52:11 UTC,,0,LOG,00000,"disconnection: session time: 0:08:03.634 user=postgres database=beta host=[local]",,,,,,,,,"pg_restore"
2018-01-26 12:00:14.848 UTC,,,61,,5a6b14b6.3d,5,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"worker process: logical replication launcher (PID 73) exited with exit code 1",,,,,,,,,""
2018-01-26 12:00:14.855 UTC,,,61,,5a6b14b6.3d,6,,2018-01-26 11:44:54 UTC,,0,LOG,00000,"worker process: bg_mon (PID 72) exited with exit code 1",,,,,,,,,""
2018-01-26 12:00:14.862 UTC,,,1400,"[local]",5a6b184e.578,1,"",2018-01-26 12:00:14 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-01-26 12:00:14.862 UTC,,,1400,"[local]",5a6b184e.578,2,"",2018-01-26 12:00:14 UTC,,0,LOG,00000,"PID 493 in cancel request did not match any process",,,,,,,,,""

Here is patroni logs in timeout:

2018-01-26 11:53:18,189 INFO: does not have lock
2018-01-26 11:53:18,222 INFO: no action.  i am a secondary and i am following a leader
2018-01-26 11:53:29,178 INFO: Lock owner: patroni-0; I am patroni-1
2018-01-26 11:53:29,178 INFO: does not have lock
2018-01-26 11:53:29,271 INFO: no action.  i am a secondary and i am following a leader
2018-01-26 11:53:41,770 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:41,770 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:45,119 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:45,119 WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:48,471 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnect
ionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappli
cation%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:48,471 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnec
tionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=3.3333333333333335)",)': /api/v1/namespaces/default/endpoints?labelSelector=cluster%3Dpatroni%2Cappl
ication%3Dpatroni%2Capp%3Dpatroni%2Crelease%3Dpatroni
2018-01-26 11:53:51,821 ERROR: get_cluster
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.5/ssl.py", line 791, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.5/ssl.py", line 575, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
.
.
.

After pg_restore crashes API server comes back and works normally. Turning off pgq and autovacuum or using patronictl pause didn’t seem to resolve this. Multiple issues raise here:

What causes API to stop responding? Is patroni making lots of requests to it? I’ve checked on normal conditions and didn’t see any issues, also server always had resources like CPU and RAM to use.
When API isn’t responding why patroni restarts database? Couldn’t it just wait until server comes up without doing any action?
Should be increase timeout period?
Is there any other method to restore back up?

Issue Analytics

State:
Created 6 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

CyberDem0ncommented, Jan 26, 2018

What causes API to stop responding? Is patroni making lots of requests to it?

Every pod is doing 2 read requests (list pods + list endpoints) every 10 seconds + 1 write request (update pod). In addition to that master is doing update leader endpoint write request. That’s not very much for k8s api.

When API isn’t responding why patroni restarts database? Couldn’t it just wait until server comes up without doing any action?

When Patroni can’t access API, it has no other option rather than demote master to read-only, because it doesn’t know what is going on. It could be that API is not accessible on a specific Pod due to network partitioning and meanwhile some other Pod see that there is no master and promotes.

Should be increase timeout period?

It’s possible to increase retry_timeout and ttl by calling patronictl edit-config on one of the Pods:

$ kubectl exec -ti patronipod-0 bash
# su postgres
$ patronictl edit-config

There is one rule you should follow when changing ttl, loop_wait and retry_timeout: ttl >= loop_wait + retry_timeout*2

is it possible that issue lies within kube-proxy?

It’s hard to know for sure, but I think that it’s definitely possible.

0reactions

alexeyklyukincommented, Feb 8, 2018

This looks like a CPU starvation of a pod running Patroni. I don’t think we can do anything here, so closing the issue.

Top Results From Across the Web

pg_restore stuck on table - DBA Stack Exchange

This seems to work for a number of tables and then it freezes at one particular table, every time. Nothing seems to happen,...

slow pg_restore on docker container startup - Stack Overflow

It is working fine on my notebook (linux, 2core CPU, SSD, 8GB RAM). I am trying to move it using docker hub to...

pg_dumpall hangs occassionally - backup - Server Fault

It looks like the piping out of the docker container broke the pipe sometimes. Using the -f parameter from pg_dumpall solved the problem...

Documentation: 15: pg_isready - PostgreSQL

pg_isready is a utility for checking the connection status of a PostgreSQL database server. The exit status specifies the result of the connection...

Connect using the Cloud SQL Auth proxy

Connect with the psql client · Start the psql client: psql "host=127.0.0.1 sslmode=disable dbname= DB_NAME user= USERNAME ". Even though the sslmode parameter...