Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
8fe9ebc
feat: timeout on onNext and onComplete
AlexKehayov Oct 16, 2025
17a495a
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 20, 2025
5c631e2
resolved post-merge conflicts
AlexKehayov Oct 20, 2025
297f835
minor docs fix
AlexKehayov Oct 20, 2025
582fc3b
changed timeout approach to blocking
AlexKehayov Oct 20, 2025
eac9149
introduced executor service for pipeline operations using virtual thr…
AlexKehayov Oct 21, 2025
00b00ca
added timeout for PbjGrpcClient creation
AlexKehayov Oct 22, 2025
54a364e
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 22, 2025
08b4fe8
spotless
AlexKehayov Oct 22, 2025
4c1ed7f
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 24, 2025
5839b60
resolve post-merge conflicts
AlexKehayov Oct 24, 2025
1529213
resolve post-merge conflicts
AlexKehayov Oct 24, 2025
ee293a1
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 27, 2025
42bc7c8
switched pipelineExecutor to constructor injection, improved tests, c…
AlexKehayov Oct 27, 2025
ea5dcb0
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 29, 2025
2b57923
resolved post-merge conflicts
AlexKehayov Oct 29, 2025
784617d
renamed blocknode-comms to block-node-comms in logs
AlexKehayov Oct 29, 2025
b3d6a3a
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Oct 29, 2025
7d4a9d0
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Nov 3, 2025
be16401
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Nov 3, 2025
76bb129
resolved post-merge conflicts
AlexKehayov Nov 3, 2025
db20bc6
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Nov 3, 2025
92a3622
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Nov 4, 2025
6a197f7
Merge branch 'main' into 21605-timeout-onnext-oncomplete
AlexKehayov Nov 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/support/citr/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/compose/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/dev/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/mainnet/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/preprod/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/previewnet/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
4 changes: 2 additions & 2 deletions hedera-node/configuration/testnet/log4j2.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,8 @@
<DefaultRolloverStrategy max="10"/>
</RollingFile>

<RollingFile name="BlockNodeCommsFile" fileName="output/blocknode-comms.log"
filePattern="output/blocknode-comms-%i.log">
<RollingFile name="BlockNodeCommsFile" fileName="output/block-node-comms.log"
filePattern="output/block-node-comms-%i.log">
<PatternLayout>
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p %-4L %c{1} - [%t] %m{nolookups}%n</pattern>
</PatternLayout>
Expand Down
49 changes: 48 additions & 1 deletion hedera-node/docs/design/app/blocks/BlockNodeConnection.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
2. [Definitions](#definitions)
3. [Component Responsibilities](#component-responsibilities)
4. [Component Interaction](#component-interaction)
5. [State Management](#state-management)
5. [Lifecycle](#lifecycle)
6. [State Machine Diagrams](#state-machine-diagrams)
7. [Error Handling](#error-handling)

Expand All @@ -29,6 +29,7 @@ It manages connection state, handles communication, and reports errors to the `B

- Establish and maintain the connection transport.
- Handle incoming and outgoing message flow.
- Detect unresponsive block nodes via configurable timeouts on pipeline operations.
- Report connection errors promptly.
- Coordinate with `BlockNodeConnectionManager` on lifecycle events.
- Notify the block buffer when a block has been acknowledged and therefore eligible to be pruned.
Expand Down Expand Up @@ -129,6 +130,7 @@ stateDiagram-v2
ACTIVE --> CLOSING : ResendBlock unavailable
ACTIVE --> CLOSING : gRPC onError
ACTIVE --> CLOSING : Stream failure
ACTIVE --> CLOSING : Pipeline operation timeout
ACTIVE --> CLOSING : Manual close
ACTIVE --> ACTIVE : BlockAcknowledgement
ACTIVE --> ACTIVE : SkipBlock
Expand Down Expand Up @@ -227,4 +229,49 @@ The connection implements a configurable rate limiting mechanism for EndOfStream

<dt>blockNode.maxRequestDelay</dt>
<dd>The maximum amount of time between attempting to send block items to a block node, regardless of the number of items ready to send.</dd>

<dt>pipelineOperationTimeout</dt>
<dd>The maximum duration allowed for pipeline onNext() and onComplete() operations before considering the block node unresponsive. Default: 30 seconds.</dd>
</dl>

### Pipeline Operation Timeout

To detect unresponsive block nodes during message transmission and connection establishment, the connection implements configurable timeouts for pipeline operations.

#### Timeout Behavior

Pipeline operations (`onNext()`, `onComplete()`, and pipeline creation) are potentially blocking I/O operations that are executed on a dedicated virtual thread executor with timeout enforcement using `Future.get(timeout)`. The executor is provided via dependency injection through the constructor, allowing for flexible configuration and easier testing.

- **Pipeline creation timeout**: When establishing the gRPC connection via `createRequestPipeline()`, both the gRPC client creation and bidirectional stream setup are executed with timeout protection. If the operation does not complete within the configured timeout period:
- The Future is cancelled to interrupt the blocked operation
- The timeout metric is incremented
- A `RuntimeException` is thrown with the underlying `TimeoutException`
- The connection remains in UNINITIALIZED state
- The connection manager's error handling will schedule a retry with exponential backoff
- **onNext() timeout**: When sending block items via `sendRequest()`, the operation is submitted to the connection's dedicated executor and the calling thread blocks waiting for completion with a timeout. If the operation does not complete within the configured timeout period:
- The Future is cancelled to interrupt the blocked operation
- The timeout metric is incremented
- `handleStreamFailure()` is triggered (only if connection is still ACTIVE)
- The connection follows standard failure handling with exponential backoff retry
- The connection manager will select a different block node for the next attempt if one is available
- `TimeoutException` is caught and handled internally
- **onComplete() timeout**: When closing the stream via `closePipeline()`, the operation is submitted to the same dedicated executor with the same timeout mechanism. If the operation does not complete within the configured timeout period:
- The Future is cancelled to interrupt the blocked operation
- The timeout metric is incremented
- Since the connection is already in CLOSING state, only the timeout is logged
- The connection completes the close operation normally

**Note**: The dedicated executor (typically a virtual thread executor in production) is provided during construction and properly shut down when the connection closes with a 5-second grace period for termination, ensuring no resource leaks. If tasks don't complete within the grace period, `shutdownNow()` is called to forcefully terminate them.

#### Exception Handling

The implementation handles multiple exception scenarios across all timeout-protected operations:
- **TimeoutException**: Pipeline operation exceeded the timeout - triggers failure handling for `onNext()` and pipeline creation, logged for `onComplete()`
- **InterruptedException**: Thread was interrupted while waiting - interrupt status is restored via `Thread.currentThread().interrupt()` before propagating the exception (for `onNext()` and pipeline creation) or logging it (for `onComplete()` and executor shutdown)
- **ExecutionException**: Error occurred during pipeline operation execution - the underlying cause is unwrapped and re-thrown (for `onNext()` and pipeline creation) or logged (for `onComplete()`)

All exception scenarios include appropriate DEBUG-level logging with context information to aid in troubleshooting.

#### Metrics

A new metric `conn_pipelineOperationTimeout` tracks the total number of timeout events for pipeline creation, `onNext()`, and `onComplete()` operations, enabling operators to monitor block node responsiveness and connection establishment issues.
Loading
Loading