-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
During the sddz storage hammer test with a 4M blob size, a corner case was identified on the follower side. It is unclear if this issue needs to be handled in the future. The details are as follows:
The follower became stuck in the handle_fetch_data_response phase due to repeated EAGAIN errors, which prevented it from completing end_of_append_batch and committing subsequent logs. The root cause of the repeated EAGAIN errors remains unknown, and the issue persisted for over two days before resolving itself.
Below are the relevant logs:
[09/23/25 01:40:32.992] [storage_mgr] [trace] [66] [raft_state_machine.cpp:59:localize_journal_entry_prepare] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Localizing Raft log_entry: server_id=1424293988, term=1, journal_entry=[version=1.1, code=HS_DATA_LINKED, server_id=1424293988, dsn=77473, header_size=48, key_size=8, value_size=8]
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [hs_blob_manager.cpp:431:blob_put_get_blk_alloc_hints] [traceID=14355508641916858133,shardID=0x1f4,pg=0,shard=0x1f4,blob=76470] Picked p_chunk_id=309, reserved_blks=16384
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [raft_repl_dev.cpp:1052:applier_create_req] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a]
[09/23/25 01:40:32.995] [storage_mgr] [trace] [66] [raft_repl_dev.cpp:1222:fetch_data_from_remote] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Fetching data from originator=1424293988, remote: rreq=[repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]], remote_blkid=[{blk#=239853 count=1025 chunk=309},], my server_id=556276154
[09/23/25 01:40:33.305] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1405:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Handling fetched data for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.315] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1423:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: the sha256 value of the data received in handle_fetch_data_response for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]] is 9974290253907391527
[09/23/25 01:40:33.315] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1453:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data fetched from remote: rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.445] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [common.cpp:176:set_lsn] [traceID=14355508641916858133] Setting lsn=77477 for request=repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=77477 state=[BLK_ALLOCATED | DATA_RECEIVED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [repl_log_store.cpp:22:append] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Received append log entry rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [raft_state_machine.cpp:192:pre_commit_ext] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Precommit rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:34.205] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
...
[09/25/25 20:51:42.562] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
[09/25/25 20:51:42.736] [storage_mgr] [trace] [74] [raft_repl_dev.cpp:2121:flush_durable_commit_lsn] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Flushing durable commit lsn to 77420
[09/25/25 20:51:42.736] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1113:operator()] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] 64 pending reqs's data are written
[09/25/25 20:51:42.736] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1447:operator()] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data Write completed rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | ]], data_write_latency_us=241869429741, total_write_latency_us=241869171509, write_num_pieces=1
[09/25/25 20:51:42.737] [storage_mgr] [trace] [66] [home_raft_log_store.cpp:212] [replstore=0] end_of_append_batch flushed upto start=77421 cnt=64 lsn=77484
[09/25/25 20:51:42.737] [storage_mgr] [warning] [66] [handle_append_entries.cxx:1101:handle_append_entries] appending entries from peer 1424293988 took long time (241869745 ms)
Metadata
Metadata
Assignees
Labels
No labels