Skip to content

Conversation

@sleipnir
Copy link
Collaborator

@sleipnir sleipnir commented Dec 8, 2025

Hi everyone, I'd like to talk a little about what I'm currently working on here.

I'm almost finished implementing a new server-side adapter entirely based on Elixir using the thousand_island library. And the results are promising, as we can see below.

Performance Metrics

Metric Cowboy ThousandIsland Improvement
Requests Processed 1,475 3,115 +111% (2.11x)
Total Time 1.46s 0.86s -41% (1.71x faster)
Throughput ~1,008 req/s ~3,637 req/s +261%
Minimum Latency 618µs 177µs -71% (3.5x faster)
Maximum Latency 89.1ms 39.6ms -56% (2.25x faster)
Average Latency ~991µs ~274µs -72% (3.6x faster)
CPU Time (User) 3.12s 1.72s -45%
CPU Time (System) 0.50s 0.48s -4%

Analysis

Advantages of ThousandIsland:

  • Dramatically higher throughput:
    Processed more than double the number of requests in the same test period.
  • Significantly lower latencies:
    • Minimum latency 3.5x lower
    • Maximum latency 2.25x lower
    • Average latency approximately 3.6x lower
  • CPU efficiency:
    Used 45% less CPU time (user time).
  • Reduced total execution time:
    Completed the tests in almost half the time

The ThousandIsland adapter demonstrates substantially superior performance compared to Cowboy across all measured aspects. The implementation delivers:

  • Better throughput (~3.6x)
  • Lower latencies (3–4x)
  • Better resource efficiency

This is still a draft and needs a lot of refinement. I opened the PR just to document the work and share it with everyone.

Adriano Santos added 18 commits December 8, 2025 15:54
Move grpc_core/lib/grpc/http2 to grpc_core/lib/grpc/transport/http2
to better reflect that HTTP/2 is the transport layer for gRPC.

Changes:
- Rename GRPC.HTTP2.* to GRPC.Transport.HTTP2.*
- Update all imports and aliases in grpc_server and grpc_client
- Update all test files
Add detailed test coverage for HTTP/2 frame implementations in the frame/
directory, focusing on gRPC-specific use cases and edge cases.

These tests cover gRPC-specific HTTP/2 scenarios including trailers-only
responses, large message handling, connection keepalive, and flow control
patterns commonly used in gRPC implementations.
Add detailed test coverage for HTTP/2 frame implementations in the frame/
directory, focusing on gRPC-specific use cases and edge cases.

These tests cover gRPC-specific HTTP/2 scenarios including trailers-only
responses, large message handling, connection keepalive, and flow control
patterns commonly used in gRPC implementations.
HTTP/2 protocol (RFC 9113) requires that HEADERS frame with :status
pseudo-header must be sent before any TRAILERS frame. The previous
implementation was conditionally skipping HEADERS when a stream had
already received END_STREAM from the client, causing protocol errors.

This fix ensures that send_grpc_error ALWAYS sends HTTP/2 HEADERS
(with required :status and :content-type headers) before sending
TRAILERS (with grpc-status and grpc-message), regardless of the
stream state (half-closed remote or not).

This resolves the 'timeout_on_sleeping_server' interop test failure
where Gun client was rejecting error responses with the message:
'A required pseudo-header was not found'.
@sleipnir
Copy link
Collaborator Author

ThousandIsland Adapter Implementation Summary

This PR introduces a pure Elixir server adapter for gRPC using ThousandIsland, providing an alternative to the Cowboy adapter.

Implementation Checklist

  • Core HTTP/2 protocol implementation with state machine for stream lifecycle management
  • Full gRPC protocol support (unary, client streaming, server streaming, bidirectional streaming)
  • Async message-based architecture for non-blocking response sending
  • HTTP/2 frame compliance (HEADERS, DATA, TRAILERS with proper pseudo-header handling)
  • Deadline/timeout support with grpc-timeout header parsing
  • Comprehensive test coverage across all packages
  • Interop test suite validation (18/18 tests passing)
  • Multiple client adapter support (Gun and Mint)
  • Error handling and graceful degradation

Test Coverage

grpc_core Package

  • 289 total tests (148 new HTTP/2 frame tests added)
  • 6 doctests
  • 100% pass rate (0 failures)

grpc_client Package

  • 191 total tests
  • 2 doctests
  • 100% pass rate (0 failures)
  • Coverage: Gun and Mint adapter integration, streaming scenarios, error handling

grpc_server Package

  • 260 total tests
  • 2 doctests
  • 100% pass rate (0 failures)
  • Coverage: ThousandIsland and Cowboy adapters, HTTP/2 connection management, stream lifecycle

Interop Test Suite

  • 18/18 tests passing with ThousandIsland adapter
  • Tested with Gun client adapter Ok
  • Tested with Mint client adapter Ok
  • All gRPC patterns validated:
    • empty_unary - Empty request/response
    • large_unary - Large payloads (10MB)
    • client_streaming - Client streaming with aggregation
    • server_streaming - Server streaming with multiple responses
    • ping_pong - Bidirectional streaming (alternating)
    • empty_stream - Bidirectional streaming with no messages
    • custom_metadata - Request/response metadata handling
    • status_code_and_message - Error status propagation
    • special_status_message - Unicode in error messages
    • unimplemented_service - Service not found (12 status)
    • unimplemented_method - Method not found (12 status)
    • cancel_after_begin - Early cancellation
    • cancel_after_first_response - Mid-stream cancellation
    • timeout_on_sleeping_server - Deadline exceeded (4 status)

Total test count across all packages: 740+ tests

Architecture Overview

Handler Structure

GRPC.Server.Adapters.ThousandIsland.Handler
├── handle_connection/2 - Initial HTTP/2 setup
├── handle_data/3 - Process incoming HTTP/2 frames
└── handle_info/2 - Async message handling for responses
    ├── {:grpc_send_data, stream_id, data}
    └── {:grpc_send_trailers, stream_id, trailers}

Connection State Machine

GRPC.Server.HTTP2.Connection
├── Stream lifecycle: idle → open → half_closed → closed
├── Frame processing: HEADERS, DATA, CONTINUATION, TRAILERS
├── Flow control: WINDOW_UPDATE, SETTINGS
├── Error handling: GOAWAY, RST_STREAM
└── State management:
    ├── headers_sent: boolean
    ├── end_stream_received: boolean
    └── stream removal after END_STREAM sent

Async Response Model

The ThousandIsland adapter uses asynchronous message passing for non-blocking response sending (for bidi-streaming):

  1. Request Phase:

    • Client sends HEADERS + DATA frames
    • Server parses HTTP/2 frames into gRPC request
    • Stream marked as open in connection state
  2. Dispatch Phase:

    • Request dispatched to service implementation
    • Stream kept alive for async response messages
    • Service receives context with response PID
  3. Response Phase:

    • Service sends responses via async messages (send(pid, {:grpc_send_data, ...}))
    • Handler receives messages in handle_info/2
    • Connection state updated after each send operation
    • Proper ordering maintained through message queue
  4. Completion Phase:

    • Service sends TRAILERS with end_stream: true
    • Stream removed from connection state
    • Resources cleaned up gracefully

Compatibility

  • Elixir: 1.14.0+ (tested with 1.18.0)
  • HTTP/2 Clients: Gun (Erlang), Mint (Elixir) - both fully tested
  • gRPC Spec: Full compliance with gRPC-over-HTTP/2 specification
  • Existing APIs: Drop-in replacement for Cowboy adapter (same configuration interface)
  • Dependencies: ThousandIsland for HTTP/2 server capabilities

Performance Characteristics

  • Pure Elixir: No NIFs or external dependencies beyond ThousandIsland
  • Message-based concurrency: Leverages BEAM's strengths for async I/O
  • Memory efficient: Stream state management with proper cleanup on completion/cancellation
  • Graceful degradation: Handles client disconnections and timeouts without crashes
  • Concurrent requests: Tested with 8 concurrent workers across 5 rounds

Benchmark Results

Performance Comparison (default configuration: 1,000 requests)

Cowboy Adapter:

  • Elapsed time: 0.923 seconds
  • Requests processed: 2,466
  • Throughput: ~2,672 req/s
  • Average latency: 0.37 ms
  • Min latency: 0.24 ms
  • Max latency: 54.67 ms
  • CPU time (user): 2.11s
  • CPU time (system): 0.32s

ThousandIsland Adapter:

  • Elapsed time: 0.842 seconds 8.8% faster
  • Requests processed: 3,226 30.8% more
  • Throughput: ~3,831 req/s 43.4% higher
  • Average latency: 0.26 ms 29.7% lower
  • Min latency: 0.16 ms
  • Max latency: 51.88 ms
  • CPU time (user): 1.74s 17.5% less CPU
  • CPU time (system): 0.46s

Summary:

  • ThousandIsland demonstrates significantly superior performance across all metrics
  • Processed 760 more requests in less time (3,226 vs 2,466)
  • Lower average latency (0.26ms vs 0.37ms) provides better user experience
  • Higher throughput (~3,800 req/s vs ~2,700 req/s) - 43% improvement
  • Lower CPU usage (1.74s vs 2.11s user time) indicates better efficiency
  • Both adapters maintain sub-millisecond average latency under load

Interop Test Stability

  • 18 tests × 5 rounds × 2 client adapters = 180 test executions
  • Average time per round: ~2-3 seconds
  • Total execution time: ~25 seconds for full suite
  • Stability: 100% success rate across all runs
  • Concurrency: 8 workers processing tests in parallel

Memory Characteristics

  • Streams properly cleaned up after completion
  • No memory leaks detected during extended test runs
  • Graceful handling of cancelled/timed-out streams

This implementation provides a solid foundation for pure Elixir gRPC servers with excellent test coverage (740+ tests) and full protocol compliance. The ThousandIsland adapter is experimental but with strong capabilities and can serve as a drop-in replacement for the Cowboy adapter.

@sleipnir sleipnir marked this pull request as ready for review December 11, 2025 04:30
Copy link
Contributor

@polvalente polvalente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on the GRPC.Server.Stream opts bug

@@ -0,0 +1,259 @@
defmodule GRPC.Server.HTTP2.FrameTest do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray file

@aseigo
Copy link
Contributor

aseigo commented Dec 16, 2025

I have an umbrella app with different apps starting their own GRPC.Server.Supervisors in their own supervision trees (with different Endpoint modules), and it does not start with this branch with the following error:

** (Mix) Could not start application discovery: Discovery.Application.start(:normal, []) returned an error: shutdown: failed to start child: GRPC.Server.Supervisor
    ** (EXIT) shutdown: failed to start child: GRPC.Server.StreamTaskSupervisor
        ** (EXIT) already started: #PID<0.871.0>

The Application module with the supervision tree:

defmodule Discovery.Application do
  @moduledoc false

  use Application

  @impl true
  def start(_type, _args) do
    children = [
      Discovery.Repo,
      Discovery.RateLimit,
      {
        GRPC.Server.Supervisor,
        endpoint: Discovery.Endpoint,
        port: Application.get_env(:discovery, :grpc_port),
        start_server: true,
        adapter_opts: [
          cred: GRPC.Credential.new(ssl: Application.get_env(:discovery, :ssl))
        ]
      }
    ]

    opts = [strategy: :one_for_one, name: Discovery.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

There is a nearly identical one in another app in the same umbrella but with a different endpoint, port, etc. So there are multiple GRPC servers and multiple Server.Supervisors running in the same BEAM instance.

This works with current releases of the grpc library, as well as with the master branch, but fails to start with this branch. Is this a known issue, or an expected change?

@polvalente
Copy link
Contributor

I have an umbrella app with different apps starting their own GRPC.Server.Supervisors in their own supervision trees (with different Endpoint modules), and it does not start with this branch with the following error:

** (Mix) Could not start application discovery: Discovery.Application.start(:normal, []) returned an error: shutdown: failed to start child: GRPC.Server.Supervisor
    ** (EXIT) shutdown: failed to start child: GRPC.Server.StreamTaskSupervisor
        ** (EXIT) already started: #PID<0.871.0>

The Application module with the supervision tree:

defmodule Discovery.Application do
  @moduledoc false

  use Application

  @impl true
  def start(_type, _args) do
    children = [
      Discovery.Repo,
      Discovery.RateLimit,
      {
        GRPC.Server.Supervisor,
        endpoint: Discovery.Endpoint,
        port: Application.get_env(:discovery, :grpc_port),
        start_server: true,
        adapter_opts: [
          cred: GRPC.Credential.new(ssl: Application.get_env(:discovery, :ssl))
        ]
      }
    ]

    opts = [strategy: :one_for_one, name: Discovery.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

There is a nearly identical one in another app in the same umbrella but with a different endpoint, port, etc. So there are multiple GRPC servers and multiple Server.Supervisors running in the same BEAM instance.

This works with current releases of the grpc library, as well as with the master branch, but fails to start with this branch. Is this a known issue, or an expected change?

This is something we already mapped out during code review yesterday. This branch is very much a work in progress.

@aseigo
Copy link
Contributor

aseigo commented Dec 16, 2025

This branch is very much a work in progress.

Understood, and that's fine. If early testing is not wanted / needed, just say so and I'll happily come back later in the process.

@polvalente
Copy link
Contributor

This branch is very much a work in progress.

Understood, and that's fine. If early testing is not wanted / needed, just say so and I'll happily come back later in the process.

@aseigo thank you very much! I think testing, specially external, is gonna be important when we move from the current state of flux we're in.

For instance, we're trying to refactor the process structure to ensure correctness, but given how much this might impact performance this might take a bit of trial and error.

We do appreciate your contributions a lot!

@aseigo
Copy link
Contributor

aseigo commented Dec 16, 2025

is gonna be important when we move from the current state of flux we're in.

Feel free to ping me when you are at that point and I'll kick some of the tires and look more closely at the implementation as well at that point. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants