question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ERROR: Error when fetching backup: pg_basebackup exited with code=1

See original GitHub issue

Hi, Can someone please guide me on this. I am configuring Postgres 12.1 with Patroni. I have a cluster with 7 nodes.Everytime I scale up, i end up with this situation. The master/Leader starts and works but the slave/replica either stops or ends up in ‘creating replica’ mode.

root@103d123f2d5c:/bp2/src# patronictl -c pg_patroni.yml list
+---------+--------+----------------+--------+------------------+----+-----------+
| Cluster | Member |      Host      |  Role  |      State       | TL | Lag in MB |
+---------+--------+----------------+--------+------------------+----+-----------+
|  blue0  |  pg_0  | 127.0.0.1:5432 | Leader |     running      |  1 |           |
|  blue0  |  pg_1  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_2  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_3  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_4  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_5  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_6  | 127.0.0.1:5432 |        | creating replica |    |   unknown |
+---------+--------+----------------+--------+------------------+----+-----------+

The logs constantly shows error such as these: 

2020-01-02 23:04:27,006 DEBUG: Sending request(xid=717): SetData(path='/bp/blue0/members/pg_1', data=b'{"conn_url":"postgres://127.0.0.1:5432/postgres","api_url":"http://127.0.0.1:8008/patroni","state":"stopped","role":"uninitialized","version":"1.6.3"}', version=-1)
2020-01-02 23:04:27,011 DEBUG: Received response(xid=717): ZnodeStat(czxid=12885154300, mzxid=12885161237, ctime=1577999766021, mtime=1578006267006, version=652, cversion=0, aversion=0, ephemeralOwner=31334655534829047, dataLength=150, numChildren=0, pzxid=12885154300)
2020-01-02 23:04:27,011 INFO: trying to bootstrap from leader 'pg_0'
2020-01-02 23:04:27,025 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:27,025 WARNING: Trying again in 5 seconds
2020-01-02 23:04:32,037 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:32,037 ERROR: failed to bootstrap from leader 'pg_0'
2020-01-02 23:04:32,037 INFO: Removing data directory: /bp2/data/psql
2020-01-02 23:04:37,004 INFO: Lock owner: pg_0; I am pg_1
2020-01-02 23:04:37,006 DEBUG: Sending request(xid=718): SetData(path='/bp/blue0/members/pg_1', data=b'{"conn_url":"postgres://127.0.0.1:5432/postgres","api_url":"http://127.0.0.1:8008/patroni","state":"stopped","role":"uninitialized","version":"1.6.3"}', version=-1)
2020-01-02 23:04:37,011 DEBUG: Received response(xid=718): ZnodeStat(czxid=12885154300, mzxid=12885161248, ctime=1577999766021, mtime=1578006277007, version=653, cversion=0, aversion=0, ephemeralOwner=31334655534829047, dataLength=150, numChildren=0, pzxid=12885154300)
2020-01-02 23:04:37,012 INFO: trying to bootstrap from leader 'pg_0'
2020-01-02 23:04:37,022 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:37,023 WARNING: Trying again in 5 seconds
2020-01-02 23:04:42,036 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:42,036 ERROR: failed to bootstrap from leader 'pg_0'
2020-01-02 23:04:42,036 INFO: Removing data directory: /bp2/data/psql

and here is my yaml file:

scope: blue0
namespace: /bp/
name: pg_1

log:
  level: DEBUG
  traceback_level: debug
  dir: /bp2/log/
   
restapi:
  listen: 127.0.0.1:8008
  connect_address: 127.0.0.1:8008

zookeeper:
  hosts: zookeeper:2181

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
  initdb: 
    - encoding: UTF8
    - data-checksums
  pg_hba: 
    - host replication replicator 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
  users:
    sbpadmin:
        password: sbpadminpw
        options:
            - createrole
            - createdb
    bpadmin:
        password: bpadminpw
        options:
            - replication
    wbpadmin:
        password: wbpadminpw
        options:
            - rewind

postgresql:
  listen: 127.0.0.1:5432
  connect_address: 127.0.0.1:5432
  config_dir: /bp2/data/psql
  data_dir: /bp2/data/psql
  bin_dir: /usr/lib/postgresql/12/bin/

  pgpass: /bp2/log/pgpass
  authentication:
    replication:
      username: bpadmin
      password: bpadminpw
    superuser:
      username: sbpadmin
      password: sbpadminpw
    rewind:
      username: wbpadmin
      password: wbpadminpw
  parameters:
    unix_socket_directories: '.'
tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10

github_iconTop GitHub Comments

2reactions
sandeepkalracommented, Jan 10, 2020

Thanks. I think I found the issue, but do not know if my fix is correct or not. There were 2 problems

  1. There was no user called ‘replicator’. 2- The pg_hba entry wasn’t correct (permissions wise). I changed pg_hba block to the following:
pg_hba:
 host replication $rep_username $localhost_ip/32 md5
 host all   $rep_username $localhost_ip/32 md5
 host replication $rep_username 0.0.0.0/0 md5
 host all   $rep_username 0.0.0.0/0 md5
 host all all 0.0.0.0/0 md5

the $rep_username and $localhost_ip were replaced with the correct username and host-ip.

After this, the replication completes for replicas.

Thanks,

1reaction
CyberDem0ncommented, Jan 4, 2020

Why do you set listen and connect_address to 127.0.0.1? How nodes will discover each other?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error on 2-Replica Cluster - Bootstrap from leader
JIC, the PRIMARY POD is hippo-instance1-tdxx. My assumption is (maybe I'm completely wrong), that with replicas=2 we have an HA setup with 1...
Read more >
PGUpgrade script not working on patroni standby leaders
PGUpgrade script not working on patroni standby leaders ... 12:46:42,416 ERROR: Error when fetching backup: pg_basebackup exited with code=1.
Read more >
Re: pg_basebackup: return value 1: reason? - PostgreSQL
> > I tried to run pg_basebackup. Return value is 1. > > > > How to find out its reason? > >...
Read more >
Thread: pg_basebackup: return value 1: reason?
I think it failed to start process to fetching wal logs created during backup: but neither on master node neither on pg_basebackup output ......
Read more >
26.3. Continuous Archiving and Point-in-Time Recovery (PITR)
You should ensure that any error condition or request to a human operator is ... 1.16 and later exit with 1 if a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found