Skip to content

[DocDB] yb-master shown healthy when it's not #28675

@vvosadchy

Description

@vvosadchy

Jira Link: DB-18374

Description

Steps to reproduce:

  1. Start group of 3 masters:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node1/data \
    --rpc_bind_addresses=127.0.0.1:7100

sudo ifconfig lo0 alias 127.0.0.2

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node2/data \
    --rpc_bind_addresses=127.0.0.2:7100

sudo ifconfig lo0 alias 127.0.0.3

./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check they are healthy:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters                                       
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	FOLLOWER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
240ce9373a8a42d18b9efa7e44021969 	127.0.0.3:7100       	ALIVE    	LEADER 	N/A
  1. Stop node3 and clear its' data:
rm -fr $HOME/yugabyte/node3/data/yb-data/*
  1. Start it again:
./bin/yb-master \
    --master_addresses=127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 \
    --fs_data_dirs=$HOME/yugabyte/node3/data \
    --rpc_bind_addresses=127.0.0.3:7100
  1. Check list of masters:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 list_all_masters
Master UUID                      	RPC Host/Port        	State    	Role 	Broadcast Host/Port 
af08844be93d4cdf9e0b94858fe33675 	127.0.0.1:7100       	ALIVE    	LEADER 	N/A                 
8bff6598e2624fbdbd20000c5dde8f0f 	127.0.0.2:7100       	ALIVE    	FOLLOWER 	N/A                 
6e9269eaa24740eaa5bc7bccda343917 	127.0.0.3:7100       	ALIVE    	FOLLOWER 	N/A 

node3 looks like a healthy FOLLOWER

  1. But if you try to promote it to LEADER:
% ./bin/yb-admin --master_addresses 127.0.0.1:7100,127.0.0.2:7100,127.0.0.3:7100 master_leader_stepdown 6e9269eaa24740eaa5bc7bccda343917
E0923 21:02:23.128075 47841792 yb-admin_client.cc:729] LeaderStepDown for af08844be93d4cdf9e0b94858fe33675received error code: LEADER_NOT_READY_TO_STEP_DOWN status { code: ILLEGAL_STATE message: "Suggested peer is not caught up yet" source_file: "../../src/yb/consensus/raft_consensus.cc" source_line: 851 errors: "\000" }
Error running master_leader_stepdown: Illegal state (yb/consensus/raft_consensus.cc:851): Suggested peer is not caught up yet

It turns out it's not healthy actually.
It remains in this state indefinitely - i.e. it doesn't catch up.

This is very misleading and can cause serious troubles if you continue working on cluster in this state.
For example if you change disk of another yb-master, then it will lead to cluster meta becoming unavailable (due to yb-master raft group losing quorum I suppose)

Expected behavior:
Such yb-master node is shown as non-healthy in the masters list

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions