Skip to content

Conversation

Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Oct 3, 2025

Problem

Users were experiencing "connection reset by peer" errors when using the RFQ system with price oracle servers (issue #1814). The core problem manifested when:

  1. An edge node establishes a connection to a price oracle server
  2. The connection sits idle between RFQ price queries (potentially for minutes or hours)
  3. The network layer silently closes the idle connection without sending a proper GOAWAY frame
  4. The client remains unaware of the broken connection
  5. The next RFQ request fails immediately with "connection reset by peer"
  6. A retry succeeds because it establishes a fresh connection

This created a poor user experience where the first payment attempt after an idle period would always fail, requiring manual retries.

Root Cause

The issue had two contributing factors:

Missing Client-Side Keepalive: The RFQ price oracle client (rfq/oracle.go) had no keepalive configuration at all. Without active health probing, the client couldn't detect when connections were broken and wouldn't prevent intermediaries (NATs, load balancers, firewalls) from timing out idle connections.

Aggressive Server-Side Timeout: The tapd gRPC server only configured MaxConnectionIdle: 2 minutes, which would aggressively close any connection idle for more than 2 minutes. This is far too short for RFQ operations where price queries may be infrequent.

Solution

We implemented comprehensive bidirectional keepalive configuration following the same pattern used by lnd:

Client-Side Configuration (rfq/oracle.go)

Added to both serverDialOpts() and insecureServerDialOpts():

grpc.WithKeepaliveParams(keepalive.ClientParameters{
    Time:                30 * time.Second,  // Ping every 30s when idle
    Timeout:             20 * time.Second,  // Wait 20s for ping response
    PermitWithoutStream: true,              // Ping even without active RPCs
})

Key Settings:

  • Time: 30s - Client sends a ping after 30 seconds of inactivity. This is frequent enough to keep connections alive through most intermediaries while being conservative with network resources.
  • Timeout: 20s - If the server doesn't respond to a ping within 20 seconds, the connection is considered dead. This allows prompt detection of network issues.
  • PermitWithoutStream: true - Critical setting. Allows pings even when no RPC calls are active. Without this, the client wouldn't send pings during idle periods, defeating the entire purpose.

Server-Side Configuration (server.go)

Enhanced the existing minimal configuration with comprehensive settings:

serverKeepalive := keepalive.ServerParameters{
    Time:              time.Minute,      // Ping clients every 1min when idle
    Timeout:           20 * time.Second, // Wait 20s for ping response
    MaxConnectionIdle: time.Hour * 24,   // Allow 24h of idle time
}

clientKeepalive := keepalive.EnforcementPolicy{
    MinTime:             5 * time.Second, // Min time between client pings
    PermitWithoutStream: true,            // Accept client pings without active RPCs
}

Key Settings:

  • Time: 1min - Server actively pings clients every minute when idle. This provides server-initiated health checking.
  • Timeout: 20s - Server waits 20 seconds for ping responses before considering the connection dead.
  • MaxConnectionIdle: 24h - Extended from 2 minutes to 24 hours. Allows long-lived connections that are crucial for RFQ operations. The active pinging mechanism (via Time parameter) handles health checking, so we don't need aggressive idle timeouts.
  • EnforcementPolicy.PermitWithoutStream: true - Critical setting. Tells the server to accept keepalive pings from clients even when there are no active RPC calls. Without this, the server would reject client pings on idle connections with a GOAWAY, breaking the client-side keepalive mechanism.
  • MinTime: 5s - Prevents abusive clients from sending pings too frequently (DoS protection).

How It Works

With these settings in place:

  1. During Idle Periods: Both client and server send periodic pings (client every 30s, server every 1min) to verify the connection is still alive
  2. Network Issue Detection: If a ping doesn't get a response within 20s, the connection is immediately closed and the application is notified
  3. Intermediary Keepalive: The regular pings prevent NATs, load balancers, and firewalls from timing out the connection
  4. No Silent Failures: Connections are never silently broken - either they work or they fail fast with proper notification

Fixes #1814

@Roasbeef Roasbeef added this to the v0.7 milestone Oct 3, 2025
Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves persistent "connection reset by peer" errors experienced by users of the RFQ system, particularly with price oracle servers. The core issue stemmed from a lack of client-side keepalive mechanisms and an overly aggressive server-side idle connection timeout. By implementing comprehensive bidirectional gRPC keepalive settings, the system will now actively maintain connections, detect network issues promptly, and prevent intermediaries from silently dropping idle connections, leading to a more robust and reliable user experience for RFQ operations.

Highlights

  • Client-Side Keepalive Configuration: Implemented grpc.WithKeepaliveParams in rfq/oracle.go for both secure and insecure dial options, setting a 30-second ping interval, 20-second timeout, and crucially, PermitWithoutStream: true to maintain connections during idle periods.
  • Server-Side Keepalive Configuration: Updated server.go to use comprehensive keepalive.ServerParameters with a 1-minute ping interval, 20-second timeout, and an extended MaxConnectionIdle of 24 hours, replacing the previous aggressive 2-minute timeout.
  • Server-Side Client Keepalive Enforcement: Added keepalive.EnforcementPolicy to the server, allowing it to accept client pings even without active RPCs (PermitWithoutStream: true) and setting a MinTime of 5 seconds between client pings for abuse prevention.
  • Resolution of "Connection Reset by Peer" Errors: These changes collectively address the "connection reset by peer" errors by ensuring active health probing and preventing network intermediaries from silently closing idle gRPC connections, significantly improving the reliability of RFQ price queries.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a "connection reset by peer" error in the RFQ system by properly configuring gRPC keep-alive settings on both the client and server. The changes introduce bidirectional keep-alive pings to maintain long-lived idle connections and allow for faster detection of broken connections. The implementation is sound and follows best practices for gRPC keep-alive. I have one suggestion to improve code maintainability by reducing duplication in the client-side configuration.

@coveralls
Copy link

coveralls commented Oct 3, 2025

Pull Request Test Coverage Report for Build 18384444381

Details

  • 33 of 33 (100.0%) changed or added relevant lines in 2 files are covered.
  • 18 unchanged lines in 6 files lost coverage.
  • Overall coverage increased (+6.9%) to 56.555%

Files with Coverage Reduction New Missed Lines %
mssmt/compacted_tree.go 2 77.65%
tapdb/mssmt.go 2 90.45%
tapdb/multiverse.go 2 81.78%
universe/supplyverifier/manager.go 2 64.53%
tapdb/assets_store.go 4 79.82%
universe/archive.go 6 78.95%
Totals Coverage Status
Change from base Build 18359865357: 6.9%
Covered Lines: 63967
Relevant Lines: 113105

💛 - Coveralls

@Roasbeef Roasbeef requested review from a team, GeorgeTsagk and darioAnongba and removed request for a team October 3, 2025 21:15
Copy link
Contributor

@darioAnongba darioAnongba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As far as I understand, price oracles are public entities of the network that can potentially be queried by a significant amount of clients. I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.

@darioAnongba
Copy link
Contributor

Failing tests are unrelated with the PR.

In this commit, we add comprehensive client-side gRPC keepalive
parameters to the price oracle client connections. This addresses the
root cause of issue lightninglabs#1814 where connections to price oracle servers
were being silently closed after idle periods, resulting in "connection
reset by peer" errors during RFQ operations.

The key change is adding PermitWithoutStream set to true, which allows
the client to send keepalive pings even when there are no active RPC
calls. This is essential for long-lived connections that may experience
extended idle periods between price queries. Without this setting, idle
connections would be closed by intermediaries or the server itself,
leaving the client unaware of the broken connection until the next RPC
attempt.

We configure the client to ping the server every 30 seconds of
inactivity and wait 20 seconds for a response. These values are
conservative enough to detect connection issues quickly while avoiding
excessive network traffic. The same keepalive parameters are applied to
both TLS and insecure (testing-only) connection modes to ensure
consistent behavior.

Fixes lightninglabs#1814
In this commit, we enhance the server-side gRPC keepalive configuration
to work in coordination with the client-side keepalive settings added
in the previous commit. This completes the fix for issue lightninglabs#1814 by
ensuring both sides of the connection actively maintain connection
health.

Previously, the server only configured MaxConnectionIdle set to 2
minutes, which would aggressively close idle connections. This caused
problems for price oracle connections that could be idle for extended
periods between RFQ operations. We now extend MaxConnectionIdle to 24
hours and add active health checking through Time and Timeout
parameters.

The critical addition is the EnforcementPolicy with PermitWithoutStream
set to true. This allows clients to send keepalive pings even when no
RPC calls are active, which is essential for long-lived connections.
Without this policy, the server would reject client keepalive pings on
idle connections, defeating the purpose of the client-side keepalive
configuration.

These settings follow the same pattern used by lnd and are based on
gRPC's official keepalive recommendations. The combination of active
pinging from both client and server, along with permissive policies,
ensures connections remain healthy and any network issues are detected
promptly rather than discovered only when the next RPC fails.

Fixes lightninglabs#1814
@Roasbeef
Copy link
Member Author

Roasbeef commented Oct 9, 2025

I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.

IIUC it's relative minor as the keep alive settings using the PING frame on the HTTP/2 layer, which is usually 8 bytes. So if we ping every 60 seconds, then 8 bytes a minute, which is ~11 KB a day.

@Roasbeef
Copy link
Member Author

Roasbeef commented Oct 9, 2025

@GeorgeTsagk PTAL

Copy link
Member

@GeorgeTsagk GeorgeTsagk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@GeorgeTsagk GeorgeTsagk added this pull request to the merge queue Oct 10, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 10, 2025
@ffranr ffranr added this pull request to the merge queue Oct 10, 2025
Merged via the queue into lightninglabs:main with commit 6928aa6 Oct 10, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Taproot-Assets Project Board Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[bug]: Issue with Paying Sats Invoices Using Taproot Assets via sendPayment with rfqId

5 participants