feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up #406

alexfrs69 · 2025-04-29T02:08:45Z

This PR introduces a feature to display a configurable "fake online" MOTD when a backend server is offline and auto-scale-up is enabled. When a status ping is received and the backend is not available, the router responds with a custom MOTD instead of appearing offline.
It also "hang" the user while retrying to connect to the backend until it reach a defined timeout and then close the connection.
Overall improving the user experience during backend startup.

Related: #36 #21

Key Changes

For fake online:

Adds configuration options for enabling/disabling the fake online MOTD and customizing its text.
Ensures the custom MOTD is only sent if auto-scale-up is enabled for the backend.

For status requests:

Tries to connect to the backend once with a short timeout
If the backend is unavailable, returns the fake online MOTD.

For connect requests:

If auto-scaler is enabled, the client connection is kept open ("hangs") while retrying backend connection until a configurable timeout is reached.
If auto-scaler is disabled, only a single backend connection attempt is made.
Lays the groundwork for further enhancements, such as caching backend status and dynamic MOTD updates.

Questions

Version Response

Currently, the status response uses:

Version: StatusVersion{
    Name:     "1.21.5",
    Protocol: 770,
}

Should the version and protocol be configurable, or is it fine to hardcode the latest supported version? What version/protocol should we default to for best compatibility?

Next Steps

The ultimate goal is to cache backend server status and possibly enhance the MOTD to indicate when a server is sleeping or waking up.

itzg · 2025-04-29T23:34:12Z

Regarding questions:

Version Response

It would be ideal to ping the backend once at startup and retain its version/protocol; however, in the spirit of incremental PRs I'm fine with it being hardcoded at first. I'll leave it up to you for this initial PR.

alexfrs69 · 2025-04-29T23:48:01Z

Thanks for your input @itzg

It seems that waker is non-nil even with auto scale up disabled.
Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option?
Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

server/connector.go

itzg · 2025-04-30T00:02:14Z

It seems that waker is non-nil even with auto scale up disabled.
Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option?
Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

Yeah, it now looks like it would be good to bundle up the config parameters to avoid the argument/field growth. Since waker is always non-nil, are you pondering shifting the new fields into that? If so, that sounds good to me.

alexfrs69 · 2025-04-30T00:17:02Z

I'm pretty new to golang. Sorry if the code is a bit messy.
I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread?
There is a lot of repeating code, lot of space for improvement

itzg · 2025-04-30T01:35:04Z

I'm pretty new to golang. Sorry if the code is a bit messy. I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread? There is a lot of repeating code, lot of space for improvement

That's totally fine. I'll point out comments where I can and can push changes to the PR if there's any that are too hard to explain.

server/connector.go

itzg · 2025-05-01T13:05:25Z

Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.

alexfrs69 · 2025-05-01T15:33:31Z

Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.

No problem 😃
I'll work on the refactor of the config in the meantime

alexfrs69 · 2025-05-01T22:58:15Z

server/cache.go

+	sc.cache[serverAddress] = status
+}
+
+func (sc *StatusCache) Delete(serverAddress string) {


Do we need to delete status cache of server that are not anymore in routes mappings?

Ideally, yes. Otherwise it would be a (very, very small) memory leak, right?

server/cache.go

alexfrs69 · 2025-05-01T23:03:44Z

server/cache.go

+	LastUpdated time.Time
+}
+
+type StatusCache struct {


heavily inspired from routes mappings, can maybe simplify a little bit by removing mu key like routesImpl ?

Yeah, that would make it consistent. I go back and forth about the unnamed/inheritance style like I used in routesImpl 😄

itzg · 2025-05-02T23:20:45Z

Previous PR is now merged, so now I can focus on yours after you get merge conflicts caught up.

alexfrs69 · 2025-05-03T01:45:34Z

Conflicts should be fixed @itzg

itzg · 2025-05-03T13:46:07Z

Thanks @alexfrs69

@Sammcb since you've been in some this code recently I was wondering if you could help look at this PR also?

itzg · 2025-05-03T13:47:26Z

@alexfrs69 just noticed the PR is still in draft. Want to mark it ready for review?

alexfrs69 · 2025-05-03T14:24:39Z

server/connector.go

+					logrus.WithField("cachedStatus", cachedStatus).Debug("Using cached status")
+					status = &mcproto.StatusResponse{
+						Players:     cachedStatus.Players,
+						Description: cachedStatus.Description,


Currently FakeOnlineMOTD is displayed only if we don't have a cached status for the backend.
Would be nice to add an option to join the Description with FakeOnlineMOTD ?

mcproto/types.go

server/connector.go

cmd/mc-router/main.go

server/connector.go

Sammcb · 2025-05-03T15:18:42Z

server/connector.go

+				}
+			}
+
+			writeStatusErr := mcproto.WriteStatusResponse(


Just curious, if this occurs does this logic need to be changed (amount, err := io.Copy(backendConn, preReadContent))? I'm wondering how this would work if the router has already returned as response to the client

If we write a status to the client we return just after. I don't think io.Copy(...) will be reached

server/connector.go

cmd/mc-router/main.go

alexfrs69 · 2025-05-03T22:39:23Z

Thanks for the reviews @itzg @Sammcb!

I've addressed your requested changes, please take another look 🙂
I haven't tested thoroughly yet, but I noticed an issue:

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

On the cache side, I think it would be better to:

Cache all backends at startup
On each player status request, refresh the cache only if the TTL has expired
When a downscale is triggered, cache the status before setting replicas to 0

This should reduce backend overhead. What do you think?

Sammcb · 2025-05-04T01:03:09Z

@alexfrs69 So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

If that is the case, my understanding is the final goal is to capture information about a backend server's state to display in status requests to the player when that server is down. One case I don't think is solved by this is if all backend servers are scaled down when mc-router starts up. Unless it wakes up all backend servers, it would not be able to get info to show to players in fake statuses. Additionally, there would be similar cases where a user might take down a backend server to make updates (version upgrades, config changes, etc.) and then leave it down until a player tries to connect. In this case, these updates might not match the status shown to players.

I'm wondering what your thoughts would be on instead of trying to get all this info from the server, simply adding a config file the user could supply with fallback status info? Obviously, this could drift from the running server status too and would be up to the user running the servers/router to keep things synced.

Thank you for doing so much extensive testing and research on these changes so far! Just wanted to suggest the config option route as I recently ran into this issue when wanting to setup allow/deny lists for server autoscalers. I realized it would be very difficult to get the information from the servers directly so opted to simply allow users to enter it as a config for the mc-router. I don't mean to intrude on your use case though, just something I thought of as another approach.

alexfrs69 · 2025-05-04T01:25:07Z

So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

Exactly

I like the idea! I definitely considered the limitation that mc-router can't get the status if the backend is stopped. Relying only on configuration for status at scale could be a bit heavy to manage, though.

We could accept this limitation for now, or, as you suggested, add support for config files so users can provide fallback status info for backends that aren't in the cache. This way, if a status isn't cached, we can return the configured fallback. It offers flexibility for those who need it, without requiring config management for every backend by default.

Sammcb · 2025-05-04T01:31:49Z

Yeah, I think either accepting that as a limitation or adding the config file input could work!

In terms of getting the data dynamically, I really like this approach (made some slight modifications to your initial suggestion just to clarify):

Cache all (available) backends at startup
On each player status request, refresh the cache only if the TTL has expired or an entry in the cache does not exist
When a downscale is triggered, cache the status before setting replicas to 0

alexfrs69 · 2025-05-04T01:40:09Z

I can work on implementing the config files tomorrow 🙂

It would also be great to automatically detect any new backends added via k8s labels and cache their status right away.
Likewise, if a backend is deleted from the mappings, we should remove its cache entry as well.

Sammcb · 2025-05-04T01:43:07Z

If you hook that into the router CreateMapping, etc. then that would be totally doable! (I did something very similar for the down scaler in #405)

itzg · 2025-05-04T15:34:46Z

cmd/mc-router/main.go


+	var cacheInterval time.Duration
+	if config.AutoScale.CacheStatus {
+		cacheInterval, err = time.ParseDuration(config.AutoScale.CacheStatusInterval)


It seems like this config processing and then ultimately cacheInterval should be fed into server.ConnectorConfig

itzg · 2025-05-04T15:36:00Z

cmd/mc-router/main.go

+		connector.StatusCache.StartUpdater(connector, cacheInterval, func() map[string]string {
+			mappings := server.Routes.GetMappings()
+			logrus.WithField("mappings", mappings).Debug("Updating status cache with mappings")
+			return mappings
+		})


...and that starting of the cache updater should be encapsulated inside of the connector.WaitForConnections method.

As it is, it's leaking a bit too much of runtime behavior into the main method.

itzg · 2025-05-04T15:37:46Z

server/cache.go

+	sc.cache[serverAddress] = status
+}
+
+func (sc *StatusCache) Delete(serverAddress string) {


Ideally, yes. Otherwise it would be a (very, very small) memory leak, right?

itzg · 2025-05-04T15:45:09Z

server/connector.go

+			Players: mcproto.StatusPlayers{
+				Max:    0,
+				Online: 0,
+				Sample: []mcproto.PlayerEntry{},


Seems to be a compile error here. Probably meant StatusPlayerEntry?

itzg · 2025-05-04T15:46:32Z

server/connector.go

 const (
-	handshakeTimeout = 5 * time.Second
+	handshakeTimeout     = 5 * time.Second
+	backendTimeout       = 30 * time.Second


Can you double check this is less than the timeout that the Minecraft client seems to run when attempting to connect?

itzg · 2025-05-04T15:49:58Z

server/connector.go

+				Name:     "UNKNOWN",
+				Protocol: 0,


Have you been able to see what the Minecraft client thinks of these version and protocol values?

itzg · 2025-05-04T15:51:40Z

server/cache.go

+	LastUpdated time.Time
+}
+
+type StatusCache struct {


Yeah, that would make it consistent. I go back and forth about the unnamed/inheritance style like I used in routesImpl 😄

itzg · 2025-05-04T15:52:25Z

server/cache.go

+	delete(sc.cache, serverAddress)
+}
+
+func (sc *StatusCache) updateAll(getBackends func() map[string]string) {


Nice encapsulation of passing in the getter function! Since it gets passed in a couple of places, I wonder if it could be provided once to NewStatusCache?

itzg · 2025-05-04T16:01:21Z

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

That's somewhat by design since Connector.acceptConnections runs its accept-loop in a single go routine. I couldn't see in the Go stdlib docs, but I was trying to find where the accept queue length could be specified. In any case, the OS layer I believe should constrain the pending connections to accept, so I think it's safe enough to not explicitly handle.

alexfrs69 · 2025-05-06T04:58:07Z

Hey guys, just letting you know that i dont have time to invest in this PR this week.
I'll work on it surely next week 🙂

L-Wehmschulte · 2025-10-10T05:39:07Z

Hi everyone. I hate to be that guy, but I was wondering about the state of this PR? Do you still intend to finish this feature @alexfrs69?
I'd love to see this implemented. I've recently moved from a docker-compose setup with lazymc, which showed a nice motd and kept the user hanging while the server was starting up, over to kubernetes and the experience of the server showing up as offline and getting kicked immediately when the server was starting up, led to a lot of confusion amongst my friends.
Thank you for all the effort you all have put in so far!

itzg · 2025-10-10T12:03:17Z

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

L-Wehmschulte · 2025-10-10T14:30:36Z

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

That would be awesome. Thank you so much!

alexfrs69 · 2025-10-16T17:52:50Z

Hi everyone, sorry I haven’t been able to finish this PR yet. I’ll try to get back to it whenever I have some free time, maybe this weekend.

Most of the features are working fine, but it still needs some polishing, especially around implementing backend status config, a status cache according to our plan, and thoroughly testing everything out (Minecraft timeouts, status protocol, etc.).

alexfrs69 marked this pull request as draft April 29, 2025 23:30

alexfrs69 changed the title ~~feat: add a fake status response when auto-scale-up is enabled~~ feat: add a fake status response and don't terminate the client request while auto-scale up Apr 29, 2025

itzg reviewed Apr 30, 2025

View reviewed changes

server/connector.go Show resolved Hide resolved

itzg reviewed Apr 30, 2025

View reviewed changes

server/connector.go Outdated Show resolved Hide resolved

alexfrs69 commented May 1, 2025

View reviewed changes

server/cache.go Outdated Show resolved Hide resolved

alexfrs69 commented May 1, 2025

View reviewed changes

server/cache.go Show resolved Hide resolved

alexfrs69 commented May 1, 2025

View reviewed changes

alexfrs69 changed the title ~~feat: add a fake status response and don't terminate the client request while auto-scale up~~ feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up May 1, 2025

alexfrs69 added 7 commits May 3, 2025 03:32

feat: add a fake status response when autoscaling is up

e5ad0f3

feat: add client connection hang

5c0baf8

fix: indentation convertion

18f8aaa

refactor: use retry library

0132bac

refactor: bundle config for connector

b0a5a2e

fix: fmt

e2e7b91

feat: add status cache

9f3262c

alexfrs69 force-pushed the feat/autoscale-status-response branch from 35e42c7 to 9f3262c Compare May 3, 2025 01:42

fix: test

9ef5492

fix: enable fakeOnline only if autoscale up && kube

8e71b2a

alexfrs69 marked this pull request as ready for review May 3, 2025 14:22

alexfrs69 commented May 3, 2025

View reviewed changes

fix: logs

861322b

Sammcb reviewed May 3, 2025

View reviewed changes

alexfrs69 added 3 commits May 4, 2025 00:24

refactor: separate functions

9cb38c6

fix: remove ttl from cache

8b02552

fix: StatusPlayerEntry

096cdb6

itzg reviewed May 4, 2025

View reviewed changes

Uh oh!

feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up #406

Are you sure you want to change the base?

feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up #406

Uh oh!

Conversation

alexfrs69 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Questions

Version Response

Next Steps

Uh oh!

itzg commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Version Response

Uh oh!

alexfrs69 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

itzg commented Apr 30, 2025

Uh oh!

alexfrs69 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itzg commented Apr 30, 2025

Uh oh!

Uh oh!

itzg commented May 1, 2025

Uh oh!

alexfrs69 commented May 1, 2025

Uh oh!

alexfrs69 May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexfrs69 May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itzg commented May 2, 2025

Uh oh!

alexfrs69 commented May 3, 2025

Uh oh!

itzg commented May 3, 2025

Uh oh!

itzg commented May 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexfrs69 May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alexfrs69 commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sammcb commented May 4, 2025

Uh oh!

alexfrs69 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sammcb commented May 4, 2025

Uh oh!

alexfrs69 commented Apr 29, 2025 •

edited

Loading

itzg commented Apr 29, 2025 •

edited

Loading

alexfrs69 commented Apr 29, 2025 •

edited

Loading

alexfrs69 commented Apr 30, 2025 •

edited

Loading

alexfrs69 May 1, 2025 •

edited

Loading

alexfrs69 May 1, 2025 •

edited

Loading

alexfrs69 May 3, 2025 •

edited

Loading

alexfrs69 commented May 3, 2025 •

edited

Loading

alexfrs69 commented May 4, 2025 •

edited

Loading

alexfrs69 commented May 4, 2025 •

edited

Loading