-
Notifications
You must be signed in to change notification settings - Fork 138
rfq+server: properly configure gRPC keep alive settings for client+server #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves persistent "connection reset by peer" errors experienced by users of the RFQ system, particularly with price oracle servers. The core issue stemmed from a lack of client-side keepalive mechanisms and an overly aggressive server-side idle connection timeout. By implementing comprehensive bidirectional gRPC keepalive settings, the system will now actively maintain connections, detect network issues promptly, and prevent intermediaries from silently dropping idle connections, leading to a more robust and reliable user experience for RFQ operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a "connection reset by peer" error in the RFQ system by properly configuring gRPC keep-alive settings on both the client and server. The changes introduce bidirectional keep-alive pings to maintain long-lived idle connections and allow for faster detection of broken connections. The implementation is sound and follows best practices for gRPC keep-alive. I have one suggestion to improve code maintainability by reducing duplication in the client-side configuration.
Pull Request Test Coverage Report for Build 18384444381Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. As far as I understand, price oracles are public entities of the network that can potentially be queried by a significant amount of clients. I understand the keepalive client side, but I'm wondering what is the impact on resources for the server side.
Failing tests are unrelated with the PR. |
In this commit, we add comprehensive client-side gRPC keepalive parameters to the price oracle client connections. This addresses the root cause of issue lightninglabs#1814 where connections to price oracle servers were being silently closed after idle periods, resulting in "connection reset by peer" errors during RFQ operations. The key change is adding PermitWithoutStream set to true, which allows the client to send keepalive pings even when there are no active RPC calls. This is essential for long-lived connections that may experience extended idle periods between price queries. Without this setting, idle connections would be closed by intermediaries or the server itself, leaving the client unaware of the broken connection until the next RPC attempt. We configure the client to ping the server every 30 seconds of inactivity and wait 20 seconds for a response. These values are conservative enough to detect connection issues quickly while avoiding excessive network traffic. The same keepalive parameters are applied to both TLS and insecure (testing-only) connection modes to ensure consistent behavior. Fixes lightninglabs#1814
In this commit, we enhance the server-side gRPC keepalive configuration to work in coordination with the client-side keepalive settings added in the previous commit. This completes the fix for issue lightninglabs#1814 by ensuring both sides of the connection actively maintain connection health. Previously, the server only configured MaxConnectionIdle set to 2 minutes, which would aggressively close idle connections. This caused problems for price oracle connections that could be idle for extended periods between RFQ operations. We now extend MaxConnectionIdle to 24 hours and add active health checking through Time and Timeout parameters. The critical addition is the EnforcementPolicy with PermitWithoutStream set to true. This allows clients to send keepalive pings even when no RPC calls are active, which is essential for long-lived connections. Without this policy, the server would reject client keepalive pings on idle connections, defeating the purpose of the client-side keepalive configuration. These settings follow the same pattern used by lnd and are based on gRPC's official keepalive recommendations. The combination of active pinging from both client and server, along with permissive policies, ensures connections remain healthy and any network issues are detected promptly rather than discovered only when the next RPC fails. Fixes lightninglabs#1814
9976595
to
ba07042
Compare
IIUC it's relative minor as the keep alive settings using the |
@GeorgeTsagk PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Problem
Users were experiencing "connection reset by peer" errors when using the RFQ system with price oracle servers (issue #1814). The core problem manifested when:
This created a poor user experience where the first payment attempt after an idle period would always fail, requiring manual retries.
Root Cause
The issue had two contributing factors:
Missing Client-Side Keepalive: The RFQ price oracle client (
rfq/oracle.go
) had no keepalive configuration at all. Without active health probing, the client couldn't detect when connections were broken and wouldn't prevent intermediaries (NATs, load balancers, firewalls) from timing out idle connections.Aggressive Server-Side Timeout: The tapd gRPC server only configured
MaxConnectionIdle: 2 minutes
, which would aggressively close any connection idle for more than 2 minutes. This is far too short for RFQ operations where price queries may be infrequent.Solution
We implemented comprehensive bidirectional keepalive configuration following the same pattern used by lnd:
Client-Side Configuration (
rfq/oracle.go
)Added to both
serverDialOpts()
andinsecureServerDialOpts()
:Key Settings:
Time: 30s
- Client sends a ping after 30 seconds of inactivity. This is frequent enough to keep connections alive through most intermediaries while being conservative with network resources.Timeout: 20s
- If the server doesn't respond to a ping within 20 seconds, the connection is considered dead. This allows prompt detection of network issues.PermitWithoutStream: true
- Critical setting. Allows pings even when no RPC calls are active. Without this, the client wouldn't send pings during idle periods, defeating the entire purpose.Server-Side Configuration (
server.go
)Enhanced the existing minimal configuration with comprehensive settings:
Key Settings:
Time: 1min
- Server actively pings clients every minute when idle. This provides server-initiated health checking.Timeout: 20s
- Server waits 20 seconds for ping responses before considering the connection dead.MaxConnectionIdle: 24h
- Extended from 2 minutes to 24 hours. Allows long-lived connections that are crucial for RFQ operations. The active pinging mechanism (viaTime
parameter) handles health checking, so we don't need aggressive idle timeouts.MinTime: 5s
- Prevents abusive clients from sending pings too frequently (DoS protection).How It Works
With these settings in place:
Fixes #1814