Skip to content

Conversation

@nikhars
Copy link
Contributor

@nikhars nikhars commented Nov 20, 2025

Implemented the basic version of relay project config handler which splits the project keys across multiple cells. Right now it does a round robin split. Will change to lookup into locator later.

Implemented the basic version of relay project config handler which
splits the project keys across multiple cells. Right now it does a
round robin split. Will change to lookup into locator later.
@nikhars nikhars requested a review from a team as a code owner November 20, 2025 19:52

Response::builder()
.status(StatusCode::OK)
.header(CONTENT_TYPE, "application/json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to copy any headers from the responses we receive from relay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code to handle the response headers. They are now added from the first upstream which makes a successful response.

Comment on lines 321 to 322
#[serde(flatten)]
pub extra: HashMap<String, JsonValue>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see no extra anywhere in this file

https://github.com/getsentry/sentry/blob/master/src/sentry/api/endpoints/relay/project_configs.py

am i looking in the wrong place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of extra is to be forward compatible with the protocol. Any field which the ingest-router does not care about (i.e the fields should be opaque) is stashed into extra. It's defined as serde(flatten), so it will be handled appropriately during both serialization and deserialization.

///
/// See module-level docs for complete protocol details, implementation strategy,
/// and request/response flow diagrams.
pub struct RelayProjectConfigsHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there will be a number of handlers that follow this pattern could this be a trait to enforce they all follow similar patterns?

e.g. something like

trait Handler<Req: Serialize + Deserialize, Res: Serialize + Deserialize> {
    fn split_requests(request: Req) -> Vec<Req>,
    fn merge_results(results: Vec<Res>) -> Res,
}

where Req and Res can be anything that implement serialize+deserialize. I think this forces the implementation of each endpoint to conform to some basic patterns and look the same which is nice for readability. It could enable more of the the common parts - spawning multiple requests and waiting for them, timeouts, error handling/backpressure to be done in a central place outside of the specific handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know of more use cases where this may be needed right now. If we see that use case pop up, then I can make this a generic trait and have multiple implementations. But I would rather do it when I see the need for it.

Copy link
Member

@lynnagara lynnagara Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other parts of the handler code looks like they are already built with the assumption that there would be many handlers though.

e.g. https://github.com/getsentry/synapse/blob/816dfb3b85e06b1ad1e2a2ffaa5479e615619950/ingest-router/src/config.rs#L50C10-L53 -- wouldn't there be many members on this enum, each of which would map 1:1 with a separate handler?

As for other use cases -- there is this list of endpoints in getsentry, which I would assume would each be separate handler implementations here
https://github.com/getsentry/sentry/blob/c3e99bba9310b1b8385ac04dae75baaaadfe61fe/src/sentry/api/urls.py#L1087-L1128

Motivation for this: You can see in the diff of this PR the complexity of what you have implemented in relay_project_config_handler module, as well as all the conversations here around timeout handling and things like that. It seems like it's easy to get wrong, and I don't think we'd want people to implement that again for every endpoint that needs to be added to synapse now and in the future?

mut handle,
public_keys,
} = task_with_keys;
let task_result = timeout(Duration::from_secs(30), &mut handle).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of wonder if we need more than 1 timeout value for various scenarios. The specific cases i'm thinking about are:

  • i think a really long value like 30 secs is fine if we have no responses from any cell -- without any data relay is making no progress anyway so it's not any worse to wait longer for a response
  • If 1 cell returns in say 200 milliseconds, and the other cell is down, waiting for the full 30 seconds for the other cell may be a serious problem. Since the requests are chunked this become a very large amount of 30 second waits back to back. If we returned sooner, the cell that was functional would be able to make progress much sooner. I'm worried that such a long wait for an unresponsive cell may risk effectively taking down all other cells as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the capability to do adaptive timeout. By default, the slower tasks get 5 more seconds than the fastest task to complete its work. Though, I would imagine when we roll this out in production we would need to have a higher value since we want to avoid the scenario of an empty cell always returning first and never letting the existing cells respond.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of wonder if we need more than 1 timeout value for various scenarios.

Having multiple configuration points for different timeouts seems wise. Waiting here would be sequential which isn't ideal. Perhaps we could combine a join_set to wait on the task futures concurrently, and a timeout so that we get as many responses as possible within the 30s timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a naive implementation right now. Working on making it robust.

/// Failed keys are added to the pending array for retry.
///
/// Global config is selected from the highest priority cell based on cells.cell_list order.
async fn collect_and_merge(&self, tasks: Vec<TaskWithKeys>, cells: &Cells) -> MergedResults {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm reading correctly doesn't this mean that we are waiting for each task and applying the modified timeout in order that cells are defined?

  • so if cell 1 returns quickly, and cell 2 is down -> we correctly apply the shorter timeout to the second request and return to relay faster
  • if cell 2 returns quickly and cell 1 does not -> we still wait for cell 1 for the 30 seconds (the longer timeout), even though there was a cell 2 response much sooner.

let now = Instant::now();
if now >= deadline {
// Deadline already passed, use minimal timeout
Duration::from_millis(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there something special about 1 millisecond here? curious why this isn't Duration::ZERO

})?;

// Split publicKeys across upstreams
let split_requests = self.split_keys_by_upstream(&request_data, cells);
Copy link
Member

@lynnagara lynnagara Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's not needed afterward, you could pass the owned request_data into this function and save yourself a clone

Comment on lines 451 to 453
for (name, value) in base_request.headers() {
req_builder = req_builder.header(name, value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you planning to include any of the header transformations that the proxy does here? e.g. adding the via and stripping hop by hop headers

let mut headers_by_cell: HashMap<String, HeaderMap> = HashMap::new();
let mut additional_deadline: Option<Instant> = None;

for task_with_keys in tasks {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using a tokio::JoinSet here

it allows you to easily get the first result with the global timeout regardless of the cell order, e.g.

let first = timeout(global_timeout, task_set.join_next()).await;

then you can loop through and keep calling next with the smaller timeout

let next = timeout(reduced_timeout, task_set.join_next()).await;

This way the tasks return in the order they are completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants