Skip to content

Conversation

msellinger
Copy link

This patch handles failure of Redis operations in a better way and is aimed to especially allow failure of an Amazon Elasticache node to trigger immediate logging and reconnection of all Redis connection MSM currently uses. The current code has multiple deficiencies in this respect, especially

(a) it wouldn't detect the failure in all cases since not all failures lead to a JedisConnectionException (the new code catches Exception instead of JedisConnectionException)

(b) It reconnects all Redis connections even when a single failure occurs to make sure when the failure is detected no old defunct connections are laying around (this might lead to hard-to-track down errors at later)

(c) It does proper INFO / ERROR logging to allow debugging the problem later in logs instead of just doing DEBUG logging which is likely switched off on a live system

Note that this patch assumes that every Redis error is an indication that the node failed, which however should be safe to assume in production systems.

… of JedisConnectionException)

- Reconnect all Redis connections when a single failure occurs (to make sure a new master is used as soon as possible)
Copy link
Owner

@magro magro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know what. If you'd like to keep things as they are I'd also merge and release.

// The old connection is not given back to the pool since it is defunct anyway
if (_log.isDebugEnabled())
_log.debug("Connection error occurred, discarding Jedis connection: " + e.getMessage());
} catch (Exception e) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wrote catching only JedisConnectionException was too restrictive. I'm not sure about the various jedis failure scenarios / reasons, but maybe there are ones where closing the client does not help? I wonder if the exception handling should be a bit more elaborate here / if the client should only be closed in certain cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had stack traces like this:

loadFromMemcached Could not load session with id 40EE1DDD6CF31EF4959FE0800E3F2BE1.2f87b4cb-8aaf-4f0a-b5f2-35ff18cbe685 from memcached.
java.lang.ClassCastException: java.lang.Long cannot be cast to [B
	at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:259)
	at redis.clients.jedis.BinaryJedis.get(BinaryJedis.java:244)
	at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:129)
	at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:127)
	at de.javakaffee.web.msm.storage.RedisStorageClient$RedisCommandCallable.call(RedisStorageClient.java:181)
	at de.javakaffee.web.msm.storage.RedisStorageClient.get(RedisStorageClient.java:135)
	at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1142)
	at de.javakaffee.web.msm.MemcachedSessionService.findSession(MemcachedSessionService.java:597)
	at de.javakaffee.web.msm.MemcachedBackupSessionManager.findSession(MemcachedBackupSessionManager.java:218)
	at org.apache.catalina.connector.Request.doGetSession(Request.java:3048)
	at org.apache.catalina.connector.Request.getSessionInternal(Request.java:2662)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:467)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
	at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:156)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
	at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:683)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1115)
	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:748)

And other exceptions being emitted from inside the Jedis library, like NullPointerException. Of course, these might ultimately be considered bugs in Jedis, but it seems to me that it is just the safest for us not to rely on Jedis being bugfree but just catch every Exception of this single, very specific command. The intention being: If this command fails, then close the connection.

Copy link
Owner

@magro magro Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the mentioned exception an indicator that the jedis instance is used by multiple threads, or - more generally - used in a wrong way, according to redis/jedis#359 or redis/jedis#1184?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding NPEs or other exceptions, how does closing the connection help? Was the connection unusable after such an exception occurred?
Even if we would handle jedis bugs here as a workaround I think these issues should definitely be addressed at the jedis guys.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have a hard time reproducing this (and sending detailed reports to the Jedis developers) since it only happened once during an ElastiCache failover (we had several successful ElastiCache failovers before). I don't think we are using connections in multiple threads concurrently, except if our pool code had errors which I don't see right now.

Anyway, in the incident I reported, after the failover, it seemed that "something" was wrong with the connections and that after restarting the webworkers (which triggered a reconnection of all Jedis instances of course) all was good. That leads to the conclusion that whenever "something" is wrong with the connections, reconnecting is a safe bet, especially as in regular operation there should be no reason for a connection to ever fail.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. Then let's keep it as it is.

// An error occurred with this Jedis connection, so now make sure to get a known-good one
// The old connection is not given back to the pool since it is assumed to be defunct anyway
if (_log.isErrorEnabled())
_log.error("Redis connection error occurred, discarding Jedis instance", e);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is handled and no manual intervention should be needed. Maybe warn would be sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can change this if you like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Note that there is a chance that connections which are currently in use are still returned to the
// pool after the shutdown because of a race condition, but since we create a new instance below
// unconditionally we can at least be sure no operation will fail a second time
shutdown();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also another race condition:

  • 2 redis requests R1 and R2 are in flight
  • R1 fails and creates a new, healthy connection C for R3
  • R3 completes and returns the healthy connection to the pool
  • R2 fails
  • This will also close the healthy connection

Most probably this is just inefficient, but in combination with the broad exception handling I fear that it might have worse consequences and might result in continuous connection creation / shutdown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a race, but will not lead to an endless loop. After there are no "unhealthy" connections in the pool anymore, the pool will just build up healthy connections again. I am not sure of the reasons a healthy connection might ever fail if there is no real "connection problem". I would be willing to test this code even on our live systems for a few days to make sure it does not happen too frequently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't see this as a prerequisite to merging this, as I'd assume that you'd submit a follow up PR if this causes issues :-)

@magro
Copy link
Owner

magro commented Mar 14, 2018

@msellinger what's the status here, are you waiting for me merging this or are you going to change anything?

@msellinger
Copy link
Author

@magro This is to inform you that we are currently testing this patch (based on the current MSM version), we will also switch to Tomcat 8.5 in the process. Once everything is tested, I will report back, then I will apply again for including this in mainline.

@msellinger
Copy link
Author

We have the current master plus this pull request running in production for several weeks now and did not notice any regressions. From my side, this pull request is ready to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants