-
Notifications
You must be signed in to change notification settings - Fork 349
Better handle Redis failure #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Better handle Redis failure #375
Conversation
… of JedisConnectionException) - Reconnect all Redis connections when a single failure occurs (to make sure a new master is used as soon as possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know what. If you'd like to keep things as they are I'd also merge and release.
// The old connection is not given back to the pool since it is defunct anyway | ||
if (_log.isDebugEnabled()) | ||
_log.debug("Connection error occurred, discarding Jedis connection: " + e.getMessage()); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wrote catching only JedisConnectionException was too restrictive. I'm not sure about the various jedis failure scenarios / reasons, but maybe there are ones where closing the client does not help? I wonder if the exception handling should be a bit more elaborate here / if the client should only be closed in certain cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I had stack traces like this:
loadFromMemcached Could not load session with id 40EE1DDD6CF31EF4959FE0800E3F2BE1.2f87b4cb-8aaf-4f0a-b5f2-35ff18cbe685 from memcached.
java.lang.ClassCastException: java.lang.Long cannot be cast to [B
at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:259)
at redis.clients.jedis.BinaryJedis.get(BinaryJedis.java:244)
at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:129)
at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:127)
at de.javakaffee.web.msm.storage.RedisStorageClient$RedisCommandCallable.call(RedisStorageClient.java:181)
at de.javakaffee.web.msm.storage.RedisStorageClient.get(RedisStorageClient.java:135)
at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1142)
at de.javakaffee.web.msm.MemcachedSessionService.findSession(MemcachedSessionService.java:597)
at de.javakaffee.web.msm.MemcachedBackupSessionManager.findSession(MemcachedBackupSessionManager.java:218)
at org.apache.catalina.connector.Request.doGetSession(Request.java:3048)
at org.apache.catalina.connector.Request.getSessionInternal(Request.java:2662)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:467)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:156)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:683)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1115)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
And other exceptions being emitted from inside the Jedis library, like NullPointerException. Of course, these might ultimately be considered bugs in Jedis, but it seems to me that it is just the safest for us not to rely on Jedis being bugfree but just catch every Exception of this single, very specific command. The intention being: If this command fails, then close the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the mentioned exception an indicator that the jedis instance is used by multiple threads, or - more generally - used in a wrong way, according to redis/jedis#359 or redis/jedis#1184?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding NPEs or other exceptions, how does closing the connection help? Was the connection unusable after such an exception occurred?
Even if we would handle jedis bugs here as a workaround I think these issues should definitely be addressed at the jedis guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have a hard time reproducing this (and sending detailed reports to the Jedis developers) since it only happened once during an ElastiCache failover (we had several successful ElastiCache failovers before). I don't think we are using connections in multiple threads concurrently, except if our pool code had errors which I don't see right now.
Anyway, in the incident I reported, after the failover, it seemed that "something" was wrong with the connections and that after restarting the webworkers (which triggered a reconnection of all Jedis instances of course) all was good. That leads to the conclusion that whenever "something" is wrong with the connections, reconnecting is a safe bet, especially as in regular operation there should be no reason for a connection to ever fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. Then let's keep it as it is.
// An error occurred with this Jedis connection, so now make sure to get a known-good one | ||
// The old connection is not given back to the pool since it is assumed to be defunct anyway | ||
if (_log.isErrorEnabled()) | ||
_log.error("Redis connection error occurred, discarding Jedis instance", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception is handled and no manual intervention should be needed. Maybe warn would be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can change this if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// Note that there is a chance that connections which are currently in use are still returned to the | ||
// pool after the shutdown because of a race condition, but since we create a new instance below | ||
// unconditionally we can at least be sure no operation will fail a second time | ||
shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also another race condition:
- 2 redis requests R1 and R2 are in flight
- R1 fails and creates a new, healthy connection C for R3
- R3 completes and returns the healthy connection to the pool
- R2 fails
- This will also close the healthy connection
Most probably this is just inefficient, but in combination with the broad exception handling I fear that it might have worse consequences and might result in continuous connection creation / shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a race, but will not lead to an endless loop. After there are no "unhealthy" connections in the pool anymore, the pool will just build up healthy connections again. I am not sure of the reasons a healthy connection might ever fail if there is no real "connection problem". I would be willing to test this code even on our live systems for a few days to make sure it does not happen too frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't see this as a prerequisite to merging this, as I'd assume that you'd submit a follow up PR if this causes issues :-)
@msellinger what's the status here, are you waiting for me merging this or are you going to change anything? |
@magro This is to inform you that we are currently testing this patch (based on the current MSM version), we will also switch to Tomcat 8.5 in the process. Once everything is tested, I will report back, then I will apply again for including this in mainline. |
We have the current master plus this pull request running in production for several weeks now and did not notice any regressions. From my side, this pull request is ready to be merged. |
This patch handles failure of Redis operations in a better way and is aimed to especially allow failure of an Amazon Elasticache node to trigger immediate logging and reconnection of all Redis connection MSM currently uses. The current code has multiple deficiencies in this respect, especially
(a) it wouldn't detect the failure in all cases since not all failures lead to a JedisConnectionException (the new code catches Exception instead of JedisConnectionException)
(b) It reconnects all Redis connections even when a single failure occurs to make sure when the failure is detected no old defunct connections are laying around (this might lead to hard-to-track down errors at later)
(c) It does proper INFO / ERROR logging to allow debugging the problem later in logs instead of just doing DEBUG logging which is likely switched off on a live system
Note that this patch assumes that every Redis error is an indication that the node failed, which however should be safe to assume in production systems.