-
Notifications
You must be signed in to change notification settings - Fork 126
Description
We came across the below issue:
Failover initiated in Postgres cluster, but the old primary indefinitely stuck in report_lsn and fast_forward states.
Incident events:
-
Primary unable to contact Standbys.
-
Monitor unable to contact all the nodes.
-
Topology
-
Primary (NODE_A1) in Datacenter A (candidate priority = 51 / replication_quorum = true
-
Standby (NODE_A2) in Datacenter A (candidate priority = 50 / replication_quorum = true
-
Standby (NODE_B1) in Datacenter B (candidate priority = 50 / replication_quorum = false
-
Standby (NODE_B2) in Datacenter B (candidate priority = 50 / replication_quorum = false
-
Standby (NODE_C1) in Datacenter C (candidate priority = 0 / replication_quorum = true
-
Monitor (NODE_CM) in Datacenter C
-
number_sync_standbys 1
-
-
Network partitioning
- Primary lost connection to all standbys including monitor
- Monitor was losing connection to all the nodes intermittently.
-
Failover scenario:
- Monitor requested primary to draining.
- Primary drained and restarted
- Primary won the leader election, with candidate-priority 51
- Primary stuck in "fast-forward" as assigned state.
- Primary instance keeps failing with "ERROR Failed to fetch WAL bytes from standby node"
(Primary keeps trying to fetch WAL bytes from itself, as seen from the IP as show below)
pg_autoctl logs:
` INFO FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
INFO Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0
INFO Stopping Postgres at "/database/pgdata"
INFO Stopping pg_autoctl postgres service
INFO /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast
WARN Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
ERROR Is the server running on that host and accepting TCP/IP connections?
ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2674 ms, pg_autoctl stops retrying now
ERROR Failed to setup standby mode: can't connect to the primary. See above for details
ERROR Failed to setup Postgres as a standby, after rewind
ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
WARN Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
ERROR Is the server running on that host and accepting TCP/IP connections?
ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.
ERROR Failed to transition to state "fast_forward", retrying...
INFO Monitor assigned new state "fast_forward"
INFO /var/lib/postgresql/bin/postgres -D /database/pgdata -p 5432 -h *
WARN PostgreSQL was not running, restarted with pid 2108617
WARN PostgreSQL was not running, restarted with pid 2108617
INFO FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
INFO Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0
INFO Stopping Postgres at "/database/pgdata"
INFO Stopping pg_autoctl postgres service
INFO /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast
WARN Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
ERROR Is the server running on that host and accepting TCP/IP connections?
ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2680 ms, pg_autoctl stops retrying now
ERROR Failed to setup standby mode: can't connect to the primary. See above for details
ERROR Failed to setup Postgres as a standby, after rewind
ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
WARN Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
ERROR Is the server running on that host and accepting TCP/IP connections?
ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.
`
Questions:
1. Once the network issue is resolved, why the old primary (NODE_A1), while becoming a new primary, trying to fetch WAL bytes from itself , going into infinite loop ?
2. Since the old and new primary nodes are same, can the fetch wal bytes skipped safely ?
3. What is causing the infinite loop ?
Appreciate if you can help resolve this issue or give some inputs.
Thanks in advance,
Nijesh