Skip to content

Conversation

@alexfrs69
Copy link

@alexfrs69 alexfrs69 commented Apr 29, 2025

This PR introduces a feature to display a configurable "fake online" MOTD when a backend server is offline and auto-scale-up is enabled. When a status ping is received and the backend is not available, the router responds with a custom MOTD instead of appearing offline.
It also "hang" the user while retrying to connect to the backend until it reach a defined timeout and then close the connection.
Overall improving the user experience during backend startup.

Related: #36 #21

Key Changes

For fake online:

  • Adds configuration options for enabling/disabling the fake online MOTD and customizing its text.
  • Ensures the custom MOTD is only sent if auto-scale-up is enabled for the backend.

For status requests:

  • Tries to connect to the backend once with a short timeout
  • If the backend is unavailable, returns the fake online MOTD.

For connect requests:

  • If auto-scaler is enabled, the client connection is kept open ("hangs") while retrying backend connection until a configurable timeout is reached.
  • If auto-scaler is disabled, only a single backend connection attempt is made.
  • Lays the groundwork for further enhancements, such as caching backend status and dynamic MOTD updates.

Questions

Version Response

Currently, the status response uses:

Version: StatusVersion{
    Name:     "1.21.5",
    Protocol: 770,
}

Should the version and protocol be configurable, or is it fine to hardcode the latest supported version? What version/protocol should we default to for best compatibility?

Next Steps

  • The ultimate goal is to cache backend server status and possibly enhance the MOTD to indicate when a server is sleeping or waking up.

@alexfrs69 alexfrs69 marked this pull request as draft April 29, 2025 23:30
@itzg
Copy link
Owner

itzg commented Apr 29, 2025

Regarding questions:

Version Response

It would be ideal to ping the backend once at startup and retain its version/protocol; however, in the spirit of incremental PRs I'm fine with it being hardcoded at first. I'll leave it up to you for this initial PR.

@alexfrs69 alexfrs69 changed the title feat: add a fake status response when auto-scale-up is enabled feat: add a fake status response and don't terminate the client request while auto-scale up Apr 29, 2025
@alexfrs69
Copy link
Author

alexfrs69 commented Apr 29, 2025

Thanks for your input @itzg

It seems that waker is non-nil even with auto scale up disabled.
Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option?
Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

@itzg
Copy link
Owner

itzg commented Apr 30, 2025

It seems that waker is non-nil even with auto scale up disabled.
Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option?
Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

Yeah, it now looks like it would be good to bundle up the config parameters to avoid the argument/field growth. Since waker is always non-nil, are you pondering shifting the new fields into that? If so, that sounds good to me.

@alexfrs69
Copy link
Author

alexfrs69 commented Apr 30, 2025

I'm pretty new to golang. Sorry if the code is a bit messy.
I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread?
There is a lot of repeating code, lot of space for improvement

@itzg
Copy link
Owner

itzg commented Apr 30, 2025

I'm pretty new to golang. Sorry if the code is a bit messy. I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread? There is a lot of repeating code, lot of space for improvement

That's totally fine. I'll point out comments where I can and can push changes to the PR if there's any that are too hard to explain.

@itzg
Copy link
Owner

itzg commented May 1, 2025

Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.

@alexfrs69
Copy link
Author

Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.

No problem 😃
I'll work on the refactor of the config in the meantime

sc.cache[serverAddress] = status
}

func (sc *StatusCache) Delete(serverAddress string) {
Copy link
Author

@alexfrs69 alexfrs69 May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete status cache of server that are not anymore in routes mappings?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. Otherwise it would be a (very, very small) memory leak, right?

LastUpdated time.Time
}

type StatusCache struct {
Copy link
Author

@alexfrs69 alexfrs69 May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heavily inspired from routes mappings, can maybe simplify a little bit by removing mu key like routesImpl ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would make it consistent. I go back and forth about the unnamed/inheritance style like I used in routesImpl 😄

@alexfrs69 alexfrs69 changed the title feat: add a fake status response and don't terminate the client request while auto-scale up feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up May 1, 2025
@itzg
Copy link
Owner

itzg commented May 2, 2025

Previous PR is now merged, so now I can focus on yours after you get merge conflicts caught up.

@alexfrs69 alexfrs69 force-pushed the feat/autoscale-status-response branch from 35e42c7 to 9f3262c Compare May 3, 2025 01:42
@alexfrs69
Copy link
Author

Conflicts should be fixed @itzg

@itzg
Copy link
Owner

itzg commented May 3, 2025

Thanks @alexfrs69

@Sammcb since you've been in some this code recently I was wondering if you could help look at this PR also?

@itzg
Copy link
Owner

itzg commented May 3, 2025

@alexfrs69 just noticed the PR is still in draft. Want to mark it ready for review?

@alexfrs69 alexfrs69 marked this pull request as ready for review May 3, 2025 14:22
logrus.WithField("cachedStatus", cachedStatus).Debug("Using cached status")
status = &mcproto.StatusResponse{
Players: cachedStatus.Players,
Description: cachedStatus.Description,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently FakeOnlineMOTD is displayed only if we don't have a cached status for the backend.
Would be nice to add an option to join the Description with FakeOnlineMOTD ?

}
}

writeStatusErr := mcproto.WriteStatusResponse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if this occurs does this logic need to be changed (amount, err := io.Copy(backendConn, preReadContent))? I'm wondering how this would work if the router has already returned as response to the client

Copy link
Author

@alexfrs69 alexfrs69 May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we write a status to the client we return just after. I don't think io.Copy(...) will be reached

@alexfrs69
Copy link
Author

alexfrs69 commented May 3, 2025

Thanks for the reviews @itzg @Sammcb!

I've addressed your requested changes, please take another look 🙂
I haven't tested thoroughly yet, but I noticed an issue:

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

On the cache side, I think it would be better to:

  • Cache all backends at startup
  • On each player status request, refresh the cache only if the TTL has expired
  • When a downscale is triggered, cache the status before setting replicas to 0

This should reduce backend overhead. What do you think?

@Sammcb
Copy link
Contributor

Sammcb commented May 4, 2025

@alexfrs69 So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

If that is the case, my understanding is the final goal is to capture information about a backend server's state to display in status requests to the player when that server is down. One case I don't think is solved by this is if all backend servers are scaled down when mc-router starts up. Unless it wakes up all backend servers, it would not be able to get info to show to players in fake statuses. Additionally, there would be similar cases where a user might take down a backend server to make updates (version upgrades, config changes, etc.) and then leave it down until a player tries to connect. In this case, these updates might not match the status shown to players.

I'm wondering what your thoughts would be on instead of trying to get all this info from the server, simply adding a config file the user could supply with fallback status info? Obviously, this could drift from the running server status too and would be up to the user running the servers/router to keep things synced.

Thank you for doing so much extensive testing and research on these changes so far! Just wanted to suggest the config option route as I recently ran into this issue when wanting to setup allow/deny lists for server autoscalers. I realized it would be very difficult to get the information from the servers directly so opted to simply allow users to enter it as a config for the mc-router. I don't mean to intrude on your use case though, just something I thought of as another approach.

@alexfrs69
Copy link
Author

alexfrs69 commented May 4, 2025

So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

Exactly

I like the idea! I definitely considered the limitation that mc-router can't get the status if the backend is stopped. Relying only on configuration for status at scale could be a bit heavy to manage, though.

We could accept this limitation for now, or, as you suggested, add support for config files so users can provide fallback status info for backends that aren't in the cache. This way, if a status isn't cached, we can return the configured fallback. It offers flexibility for those who need it, without requiring config management for every backend by default.

@Sammcb
Copy link
Contributor

Sammcb commented May 4, 2025

Yeah, I think either accepting that as a limitation or adding the config file input could work!

In terms of getting the data dynamically, I really like this approach (made some slight modifications to your initial suggestion just to clarify):

  • Cache all (available) backends at startup
  • On each player status request, refresh the cache only if the TTL has expired or an entry in the cache does not exist
  • When a downscale is triggered, cache the status before setting replicas to 0

@alexfrs69
Copy link
Author

alexfrs69 commented May 4, 2025

I can work on implementing the config files tomorrow 🙂

It would also be great to automatically detect any new backends added via k8s labels and cache their status right away.
Likewise, if a backend is deleted from the mappings, we should remove its cache entry as well.

@Sammcb
Copy link
Contributor

Sammcb commented May 4, 2025

If you hook that into the router CreateMapping, etc. then that would be totally doable! (I did something very similar for the down scaler in #405)


var cacheInterval time.Duration
if config.AutoScale.CacheStatus {
cacheInterval, err = time.ParseDuration(config.AutoScale.CacheStatusInterval)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this config processing and then ultimately cacheInterval should be fed into server.ConnectorConfig

Comment on lines +271 to +275
connector.StatusCache.StartUpdater(connector, cacheInterval, func() map[string]string {
mappings := server.Routes.GetMappings()
logrus.WithField("mappings", mappings).Debug("Updating status cache with mappings")
return mappings
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and that starting of the cache updater should be encapsulated inside of the connector.WaitForConnections method.

As it is, it's leaking a bit too much of runtime behavior into the main method.

sc.cache[serverAddress] = status
}

func (sc *StatusCache) Delete(serverAddress string) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. Otherwise it would be a (very, very small) memory leak, right?

Players: mcproto.StatusPlayers{
Max: 0,
Online: 0,
Sample: []mcproto.PlayerEntry{},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a compile error here. Probably meant StatusPlayerEntry?

const (
handshakeTimeout = 5 * time.Second
handshakeTimeout = 5 * time.Second
backendTimeout = 30 * time.Second
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check this is less than the timeout that the Minecraft client seems to run when attempting to connect?

Comment on lines +710 to +711
Name: "UNKNOWN",
Protocol: 0,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you been able to see what the Minecraft client thinks of these version and protocol values?

LastUpdated time.Time
}

type StatusCache struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would make it consistent. I go back and forth about the unnamed/inheritance style like I used in routesImpl 😄

delete(sc.cache, serverAddress)
}

func (sc *StatusCache) updateAll(getBackends func() map[string]string) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice encapsulation of passing in the getter function! Since it gets passed in a couple of places, I wonder if it could be provided once to NewStatusCache?

@itzg
Copy link
Owner

itzg commented May 4, 2025

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

That's somewhat by design since Connector.acceptConnections runs its accept-loop in a single go routine. I couldn't see in the Go stdlib docs, but I was trying to find where the accept queue length could be specified. In any case, the OS layer I believe should constrain the pending connections to accept, so I think it's safe enough to not explicitly handle.

@alexfrs69
Copy link
Author

Hey guys, just letting you know that i dont have time to invest in this PR this week.
I'll work on it surely next week 🙂

@L-Wehmschulte
Copy link

Hi everyone. I hate to be that guy, but I was wondering about the state of this PR? Do you still intend to finish this feature @alexfrs69?
I'd love to see this implemented. I've recently moved from a docker-compose setup with lazymc, which showed a nice motd and kept the user hanging while the server was starting up, over to kubernetes and the experience of the server showing up as offline and getting kicked immediately when the server was starting up, led to a lot of confusion amongst my friends.
Thank you for all the effort you all have put in so far!

@itzg
Copy link
Owner

itzg commented Oct 10, 2025

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

@L-Wehmschulte
Copy link

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

That would be awesome. Thank you so much!

@alexfrs69
Copy link
Author

Hi everyone, sorry I haven’t been able to finish this PR yet. I’ll try to get back to it whenever I have some free time, maybe this weekend.

Most of the features are working fine, but it still needs some polishing, especially around implementing backend status config, a status cache according to our plan, and thoroughly testing everything out (Minecraft timeouts, status protocol, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants