Better handle Redis failure #375

msellinger · 2018-01-15T15:19:15Z

This patch handles failure of Redis operations in a better way and is aimed to especially allow failure of an Amazon Elasticache node to trigger immediate logging and reconnection of all Redis connection MSM currently uses. The current code has multiple deficiencies in this respect, especially

(a) it wouldn't detect the failure in all cases since not all failures lead to a JedisConnectionException (the new code catches Exception instead of JedisConnectionException)

(b) It reconnects all Redis connections even when a single failure occurs to make sure when the failure is detected no old defunct connections are laying around (this might lead to hard-to-track down errors at later)

(c) It does proper INFO / ERROR logging to allow debugging the problem later in logs instead of just doing DEBUG logging which is likely switched off on a live system

Note that this patch assumes that every Redis error is an indication that the node failed, which however should be safe to assume in production systems.

… of JedisConnectionException) - Reconnect all Redis connections when a single failure occurs (to make sure a new master is used as soon as possible)

magro

Please let me know what. If you'd like to keep things as they are I'd also merge and release.

magro · 2018-01-18T20:24:55Z

core/src/main/java/de/javakaffee/web/msm/storage/RedisStorageClient.java

-                // The old connection is not given back to the pool since it is defunct anyway
-                if (_log.isDebugEnabled())
-                    _log.debug("Connection error occurred, discarding Jedis connection: " + e.getMessage());
+            } catch (Exception e) {


As you wrote catching only JedisConnectionException was too restrictive. I'm not sure about the various jedis failure scenarios / reasons, but maybe there are ones where closing the client does not help? I wonder if the exception handling should be a bit more elaborate here / if the client should only be closed in certain cases.

Well, I had stack traces like this:

loadFromMemcached Could not load session with id 40EE1DDD6CF31EF4959FE0800E3F2BE1.2f87b4cb-8aaf-4f0a-b5f2-35ff18cbe685 from memcached. java.lang.ClassCastException: java.lang.Long cannot be cast to [B at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:259) at redis.clients.jedis.BinaryJedis.get(BinaryJedis.java:244) at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:129) at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:127) at de.javakaffee.web.msm.storage.RedisStorageClient$RedisCommandCallable.call(RedisStorageClient.java:181) at de.javakaffee.web.msm.storage.RedisStorageClient.get(RedisStorageClient.java:135) at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1142) at de.javakaffee.web.msm.MemcachedSessionService.findSession(MemcachedSessionService.java:597) at de.javakaffee.web.msm.MemcachedBackupSessionManager.findSession(MemcachedBackupSessionManager.java:218) at org.apache.catalina.connector.Request.doGetSession(Request.java:3048) at org.apache.catalina.connector.Request.getSessionInternal(Request.java:2662) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:467) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169) at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:156) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:683) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1115) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748)

And other exceptions being emitted from inside the Jedis library, like NullPointerException. Of course, these might ultimately be considered bugs in Jedis, but it seems to me that it is just the safest for us not to rely on Jedis being bugfree but just catch every Exception of this single, very specific command. The intention being: If this command fails, then close the connection.

Isn't the mentioned exception an indicator that the jedis instance is used by multiple threads, or - more generally - used in a wrong way, according to redis/jedis#359 or redis/jedis#1184?

Regarding NPEs or other exceptions, how does closing the connection help? Was the connection unusable after such an exception occurred?
Even if we would handle jedis bugs here as a workaround I think these issues should definitely be addressed at the jedis guys.

I would have a hard time reproducing this (and sending detailed reports to the Jedis developers) since it only happened once during an ElastiCache failover (we had several successful ElastiCache failovers before). I don't think we are using connections in multiple threads concurrently, except if our pool code had errors which I don't see right now.

Anyway, in the incident I reported, after the failover, it seemed that "something" was wrong with the connections and that after restarting the webworkers (which triggered a reconnection of all Jedis instances of course) all was good. That leads to the conclusion that whenever "something" is wrong with the connections, reconnecting is a safe bet, especially as in regular operation there should be no reason for a connection to ever fail.

Ok, got it. Then let's keep it as it is.

magro · 2018-01-18T20:32:25Z

core/src/main/java/de/javakaffee/web/msm/storage/RedisStorageClient.java

+                // An error occurred with this Jedis connection, so now make sure to get a known-good one
+                // The old connection is not given back to the pool since it is assumed to be defunct anyway
+                if (_log.isErrorEnabled())
+                    _log.error("Redis connection error occurred, discarding Jedis instance", e);


The exception is handled and no manual intervention should be needed. Maybe warn would be sufficient?

Yes, I can change this if you like.

magro · 2018-01-18T20:44:44Z

core/src/main/java/de/javakaffee/web/msm/storage/RedisStorageClient.java

+                // Note that there is a chance that connections which are currently in use are still returned to the
+                // pool after the shutdown because of a race condition, but since we create a new instance below
+                // unconditionally we can at least be sure no operation will fail a second time
+                shutdown();


There's also another race condition:

2 redis requests R1 and R2 are in flight

R1 fails and creates a new, healthy connection C for R3

R3 completes and returns the healthy connection to the pool

R2 fails

This will also close the healthy connection

Most probably this is just inefficient, but in combination with the broad exception handling I fear that it might have worse consequences and might result in continuous connection creation / shutdown.

This is a race, but will not lead to an endless loop. After there are no "unhealthy" connections in the pool anymore, the pool will just build up healthy connections again. I am not sure of the reasons a healthy connection might ever fail if there is no real "connection problem". I would be willing to test this code even on our live systems for a few days to make sure it does not happen too frequently.

I wouldn't see this as a prerequisite to merging this, as I'd assume that you'd submit a follow up PR if this causes issues :-)

magro · 2018-03-14T22:29:19Z

@msellinger what's the status here, are you waiting for me merging this or are you going to change anything?

msellinger · 2019-05-03T11:16:40Z

@magro This is to inform you that we are currently testing this patch (based on the current MSM version), we will also switch to Tomcat 8.5 in the process. Once everything is tested, I will report back, then I will apply again for including this in mainline.

msellinger · 2019-06-07T10:36:33Z

We have the current master plus this pull request running in production for several weeks now and did not notice any regressions. From my side, this pull request is ready to be merged.

- Detect all failures in Redis communication (catch Exception instead…

464c43b

… of JedisConnectionException) - Reconnect all Redis connections when a single failure occurs (to make sure a new master is used as soon as possible)

magro reviewed Jan 18, 2018

View reviewed changes

Merge branch 'master' into redis-fix-reconnect-to-new-master

129a003

Better handle Redis failure #375

Are you sure you want to change the base?

Better handle Redis failure #375

Uh oh!

Conversation

msellinger commented Jan 15, 2018

Uh oh!

magro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

magro Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

magro commented Mar 14, 2018

Uh oh!

msellinger commented May 3, 2019

Uh oh!

msellinger commented Jun 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

magro Jan 20, 2018 •

edited

Loading