Skip to content

Follower Stuck in handle_fetch_data_response Due to Prolonged EAGAIN Errors #815

@Besroy

Description

@Besroy

During the sddz storage hammer test with a 4M blob size, a corner case was identified on the follower side. It is unclear if this issue needs to be handled in the future. The details are as follows:

The follower became stuck in the handle_fetch_data_response phase due to repeated EAGAIN errors, which prevented it from completing end_of_append_batch and committing subsequent logs. The root cause of the repeated EAGAIN errors remains unknown, and the issue persisted for over two days before resolving itself.

Below are the relevant logs:

[09/23/25 01:40:32.992] [storage_mgr] [trace] [66] [raft_state_machine.cpp:59:localize_journal_entry_prepare] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Localizing Raft log_entry: server_id=1424293988, term=1, journal_entry=[version=1.1, code=HS_DATA_LINKED, server_id=1424293988, dsn=77473, header_size=48, key_size=8, value_size=8]
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [hs_blob_manager.cpp:431:blob_put_get_blk_alloc_hints] [traceID=14355508641916858133,shardID=0x1f4,pg=0,shard=0x1f4,blob=76470] Picked p_chunk_id=309, reserved_blks=16384
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [raft_repl_dev.cpp:1052:applier_create_req] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a]
[09/23/25 01:40:32.995] [storage_mgr] [trace] [66] [raft_repl_dev.cpp:1222:fetch_data_from_remote] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Fetching data from originator=1424293988, remote: rreq=[repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]], remote_blkid=[{blk#=239853 count=1025 chunk=309},], my server_id=556276154
[09/23/25 01:40:33.305] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1405:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Handling fetched data for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.315] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1423:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: the sha256 value of the data received in handle_fetch_data_response for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]] is 9974290253907391527
[09/23/25 01:40:33.315] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1453:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data fetched from remote: rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]

[09/23/25 01:40:33.445] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0

[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [common.cpp:176:set_lsn] [traceID=14355508641916858133] Setting lsn=77477 for request=repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=77477 state=[BLK_ALLOCATED | DATA_RECEIVED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [repl_log_store.cpp:22:append] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Received append log entry rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [raft_state_machine.cpp:192:pre_commit_ext] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Precommit rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]

[09/23/25 01:40:34.205] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
...

[09/25/25 20:51:42.562] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0

[09/25/25 20:51:42.736] [storage_mgr] [trace] [74] [raft_repl_dev.cpp:2121:flush_durable_commit_lsn] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Flushing durable commit lsn to 77420
[09/25/25 20:51:42.736] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1113:operator()] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] 64 pending reqs's data are written
[09/25/25 20:51:42.736] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1447:operator()] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data Write completed rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | ]], data_write_latency_us=241869429741, total_write_latency_us=241869171509, write_num_pieces=1
[09/25/25 20:51:42.737] [storage_mgr] [trace] [66] [home_raft_log_store.cpp:212] [replstore=0] end_of_append_batch flushed upto start=77421 cnt=64 lsn=77484
[09/25/25 20:51:42.737] [storage_mgr] [warning] [66] [handle_append_entries.cxx:1101:handle_append_entries] appending entries from peer 1424293988 took long time (241869745 ms)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions