Add networking guide docs #138119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

mhl-b wants to merge 2 commits into elastic:main from mhl-b:networking-guide

Contributor

mhl-b commented Nov 14, 2025

A high level overview of networking for the distributed architecture guide. ES-7886


          add distrib networking guide docs

a83e71f

elasticsearchmachine added needs:triage v9.3.0 labels

mhl-b added Team:Distributed Coordination >non-issue :Distributed Coordination/Distributed and removed needs:triage labels

github-actions bot deployed to docs-preview

November 14, 2025 18:23

View deployment

Collaborator

elasticsearchmachine commented Nov 14, 2025

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

mhl-b requested review from DaveCTurner and ywangd

November 14, 2025 18:25


          corrections

78e7752

github-actions bot deployed to docs-preview

November 14, 2025 18:55

View deployment

DaveCTurner reviewed

View reviewed changes

Contributor

DaveCTurner left a comment

I left a few factual nits, nothing major. It needs some editorial polishing too - can we use any of our fancy new AI tools to help with that?

docs/internal/DistributedArchitectureGuide.md

Comment on lines +26 to +27

    
              be HTTP spec compliant, Elastic is not a webserver. We support GET requests with

              payload (some old proxies might drop content), requests cannot be cached by

Contributor

DaveCTurner Nov 15, 2025

Maybe mention that in these cases we also support the same API with a different verb, normally POST, for clients who cannot send a GET-with-body.

docs/internal/DistributedArchitectureGuide.md

    
              HTTP transport provides two options for content processing: aggregate fully and

              incremental. Aggregated content is a preferable choice for a small messages that

              cannot be parsed incrementally (like JSON). But aggregation has drawbacks, it

Contributor

DaveCTurner Nov 15, 2025

cannot be parsed incrementally (like JSON)

Mmm sorta depends what you mean by "incrementally" but I'd say that the SAX-style parsing we do is in fact working incrementally. It doesn't really make sense to start parsing before we've received the whole body, but that's different from saying "cannot".

docs/internal/DistributedArchitectureGuide.md

Comment on lines +78 to +79

    
              The job of the REST handler is to parse and validate HTTP request and construct

              typed version of request, often Transport request (see Transport section below).

Contributor

DaveCTurner Nov 15, 2025

Might be worth pointing out here that (if security is enabled then) authentication happens before the REST handler but authorization happens after (when entering the transport layer).

docs/internal/DistributedArchitectureGuide.md

    
              `Transport` is an umbrella term for a node-to-node communication. It's a

              TCP-based custom binary protocol. Every node in a cluster is a client and server

              at the same time. Node-to-node communication never uses HTTP transport.

Contributor

DaveCTurner Nov 15, 2025

never

except for reindex-from-remote :)

docs/internal/DistributedArchitectureGuide.md

    
              connections for different purposes: ping, node-state, bulks, etc. Pool structure

              is defined in `ConnectionProfile` class.

              ES has resilience to disconnects, frequent reconnects, but in general we assume

Contributor

DaveCTurner Nov 15, 2025

resilience

I don't think we are particularly resilient to frequent reconnects. At least, not without being more specific. ES never behaves incorrectly (e.g. loses data) in the face of network outages but it may become unavailable unless the network is stable.

docs/internal/DistributedArchitectureGuide.md

    
              Another area of different networking clients is snapshotting. ES supports

              snapshotting to remote repositories. These repositories usually come with their

              own SDK and networking stack. For example AWS SDK comes with tomcat or netty,

Contributor

DaveCTurner Nov 15, 2025

Although the Apache client we use may have originally come from Tomcat, it's not called that any more and this'll confuse folks that don't know the history. Let's just call it the Apache client.

docs/internal/DistributedArchitectureGuide.md

    
              be forked to another thread pool. But forking comes with overhead, doing forking

              on every tiny request is a wasted CPU work. As a rule of thumb: don't fork

              simple requests that can be served from memory and do not require heavy

              computations (seconds), otherwise fork.

Contributor

DaveCTurner Nov 15, 2025

A pause of "seconds" is still pretty disastrous

Suggested change

      
            computations (seconds), otherwise fork.
          
            computations (millseconds), otherwise fork.

docs/internal/DistributedArchitectureGuide.md

    
              One of the performance edges of netty is controlled memory allocation. Netty

              manages byte buffer pools and reuse them heavily. This performance gain comes

              with a cost. And cost is reference counting on developer's shoulders. Netty

              reads socket bytes into pooled byte-buffers and passes them to application. Then

Contributor

DaveCTurner Nov 15, 2025

Might be worth saying a bit more about how this looks in a heap dump - there's a pool of 1MiB byte[] objects which Netty slices into 16kiB pages, not all of which may be in use, so you have to account for this when investigating memory usage.

docs/internal/DistributedArchitectureGuide.md

    
              One of the performance edges of netty is controlled memory allocation. Netty

              manages byte buffer pools and reuse them heavily. This performance gain comes

              with a cost. And cost is reference counting on developer's shoulders. Netty

              reads socket bytes into pooled byte-buffers and passes them to application. Then

Contributor

DaveCTurner Nov 15, 2025

Also Netty technically reads the bytes into a direct buffer and then CopyBytesSocketChannel copies them into the pooled buffer.

docs/internal/DistributedArchitectureGuide.md

    
              Another area of different networking clients is snapshotting. ES supports

              snapshotting to remote repositories. These repositories usually come with their

              own SDK and networking stack. For example AWS SDK comes with tomcat or netty,

              Azure with netty-based project-reactor, GCP uses default java HTTP

Contributor

DaveCTurner Nov 15, 2025

Netty needs a capital N (here and in several other places)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Distributed >non-issue Team:Distributed Coordination v9.3.0