Skip to content

Failover stuck in "report_lsn" and "fast_forward" states indefinitely. #1060

@nijesh8

Description

@nijesh8

We came across the below issue:

Failover initiated in Postgres cluster, but the old primary indefinitely stuck in report_lsn and fast_forward states.

Incident events:

  1. Primary unable to contact Standbys.

  2. Monitor unable to contact all the nodes.

  3. Topology

    • Primary (NODE_A1) in Datacenter A (candidate priority = 51 / replication_quorum = true

    • Standby (NODE_A2) in Datacenter A (candidate priority = 50 / replication_quorum = true

    • Standby (NODE_B1) in Datacenter B (candidate priority = 50 / replication_quorum = false

    • Standby (NODE_B2) in Datacenter B (candidate priority = 50 / replication_quorum = false

    • Standby (NODE_C1) in Datacenter C (candidate priority = 0 / replication_quorum = true

    • Monitor (NODE_CM) in Datacenter C

    • number_sync_standbys 1

  4. Network partitioning

    • Primary lost connection to all standbys including monitor
    • Monitor was losing connection to all the nodes intermittently.
  5. Failover scenario:

    1. Monitor requested primary to draining.
    2. Primary drained and restarted
    3. Primary won the leader election, with candidate-priority 51
    4. Primary stuck in "fast-forward" as assigned state.
    5. Primary instance keeps failing with "ERROR Failed to fetch WAL bytes from standby node"
      (Primary keeps trying to fetch WAL bytes from itself, as seen from the IP as show below)

pg_autoctl logs:

` INFO FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
INFO Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0

	INFO  Stopping Postgres at "/database/pgdata"
	INFO  Stopping pg_autoctl postgres service
	INFO  /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast
	WARN  Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
	ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
	ERROR       Is the server running on that host and accepting TCP/IP connections?
	ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2674 ms, pg_autoctl stops retrying now
	ERROR Failed to setup standby mode: can't connect to the primary. See above for details
	ERROR Failed to setup Postgres as a standby, after rewind
	ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
	WARN  Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
	ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
	ERROR       Is the server running on that host and accepting TCP/IP connections?
	ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
	ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
	ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.
	ERROR Failed to transition to state "fast_forward", retrying...

	INFO  Monitor assigned new state "fast_forward"
	INFO   /var/lib/postgresql/bin/postgres -D /database/pgdata -p 5432 -h *
	WARN  PostgreSQL was not running, restarted with pid 2108617
	WARN  PostgreSQL was not running, restarted with pid 2108617
	INFO  FSM transition from "report_lsn" to "fast_forward": Fetching missing WAL bits from another standby before promotion
	INFO  Fetching WAL from upstream node 4 "NODE_A1" (<<primary_IP>>:5432)up to LSN 1284/700000A0
	INFO  Stopping Postgres at "/database/pgdata"
	INFO  Stopping pg_autoctl postgres service
	INFO  /var/lib/postgresql/bin/pg_ctl --pgdata /database/pgdata --wait stop --mode fast

	WARN  Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1", retrying until the server is ready
	ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
	ERROR       Is the server running on that host and accepting TCP/IP connections?
	ERROR Failed to connect to "postgres://pgautofailover_replicator@<<primary_IP>>:5432/?application_name=pgautofailover_standby_4&sslmode=require&replication=1" after 7 attempts in 2680 ms, pg_autoctl stops retrying now
	ERROR Failed to setup standby mode: can't connect to the primary. See above for details

	ERROR Failed to setup Postgres as a standby, after rewind
	ERROR Failed to setup replication from upstream node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
	WARN  Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?", retrying until the server is ready
	ERROR Connection to database failed: connection to server at "<<primary_IP>>", port 5432 failed: Connection refused
	ERROR       Is the server running on that host and accepting TCP/IP connections?
	ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now

	ERROR Failed to connect to "postgres://@<<primary_IP>>:5432/postgres?" after 2 attempts in 2000 ms, pg_autoctl stops retrying now
	ERROR Failed to fetch WAL bytes from standby node 4 "NODE_A1" (<<primary_IP>>:5432), see above for details
	ERROR Failed to transition from state "report_lsn" to state "fast_forward", see above.

`

Questions:

1.	Once the network issue is resolved, why the old primary (NODE_A1), while becoming a new primary, trying to fetch WAL bytes from itself , going into infinite loop ?

2. Since the old and new primary nodes are same, can the fetch wal bytes skipped safely ? 

3. What is causing the infinite loop ?

Appreciate if you can help resolve this issue or give some inputs.

Thanks in advance,
Nijesh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions