Failover stuck in "report_lsn" and "fast_forward" states indefinitely.


We came across the below issue:

Failover initiated in Postgres cluster, but the old primary indefinitely stuck in report_lsn and fast_forward states. 

Incident events:
======================
1. Primary unable to contact Standbys.

2. Monitor unable to contact all the nodes.

3. Topology 
	- Primary (NODE_A1) in Datacenter A (candidate priority = 51 / replication_quorum = true
	- Standby (NODE_A2) in Datacenter A  (candidate priority = 50 / replication_quorum = true
	- Standby (NODE_B1) in Datacenter B  (candidate priority = 50 / replication_quorum = false
	- Standby (NODE_B2) in Datacenter B  (candidate priority = 50 / replication_quorum = false
	- Standby (NODE_C1) in Datacenter C (candidate priority = 0 / replication_quorum = true
	- Monitor (NODE_CM) in Datacenter C 

	- number_sync_standbys 1

4. Network partitioning

	- Primary lost connection to all standbys including monitor
	- Monitor was losing connection to all the nodes intermittently.
	
5. Failover scenario:

	1. Monitor requested primary to draining.
	2. Primary drained and restarted
	3. Primary won the leader election, with candidate-priority 51
	4. Primary stuck in "fast-forward" as assigned state.
	5. Primary instance keeps failing with "ERROR Failed to fetch WAL bytes from standby node"
		(Primary keeps trying to fetch WAL bytes from itself, as seen from the IP as show below)

pg_autoctl logs:
================

`		INFO  FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
		INFO  Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0

		INFO  Stopping Postgres at "/database/pgdata"
		INFO  Stopping pg_autoctl postgres service
		INFO  /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast
		WARN  Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
		ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
		ERROR       Is the server running on that host and accepting TCP/IP connections?
		ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2674 ms, pg_autoctl stops retrying now
		ERROR Failed to setup standby mode: can't connect to the primary. See above for details
		ERROR Failed to setup Postgres as a standby, after rewind
		ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
		WARN  Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
		ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
		ERROR       Is the server running on that host and accepting TCP/IP connections?
		ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
		ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
		ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.
		ERROR Failed to transition to state "fast_forward", retrying...

		INFO  Monitor assigned new state "fast_forward"
		INFO   /var/lib/postgresql/bin/postgres -D /database/pgdata -p 5432 -h *
		WARN  PostgreSQL was not running, restarted with pid 2108617
		WARN  PostgreSQL was not running, restarted with pid 2108617
		INFO  FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
		INFO  Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0
		INFO  Stopping Postgres at "/database/pgdata"
		INFO  Stopping pg_autoctl postgres service
		INFO  /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast

		WARN  Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
		ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
		ERROR       Is the server running on that host and accepting TCP/IP connections?
		ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2680 ms, pg_autoctl stops retrying now
		ERROR Failed to setup standby mode: can't connect to the primary. See above for details

		ERROR Failed to setup Postgres as a standby, after rewind
		ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
		WARN  Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
		ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
		ERROR       Is the server running on that host and accepting TCP/IP connections?
		ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now

		ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
		ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
		ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.

`
		



Questions:

	1.	Once the network issue is resolved, why the old primary (NODE_A1), while becoming a new primary, trying to fetch WAL bytes from itself , going into infinite loop ?

	2. Since the old and new primary nodes are same, can the fetch wal bytes skipped safely ? 

	3. What is causing the infinite loop ?


Appreciate if you can help resolve this issue or give some inputs.


Thanks in advance, 
Nijesh


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failover stuck in "report_lsn" and "fast_forward" states indefinitely. #1060

Incident events:

pg_autoctl logs:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failover stuck in "report_lsn" and "fast_forward" states indefinitely. #1060

Description

Incident events:

pg_autoctl logs:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions