Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@
import org.apache.juli.logging.Log;
import org.apache.juli.logging.LogFactory;
import redis.clients.jedis.BinaryJedis;
import redis.clients.jedis.exceptions.JedisConnectionException;

import java.net.URI;
import java.net.URISyntaxException;
Expand Down Expand Up @@ -178,11 +177,11 @@ public T call() throws Exception {
try {
jedis = _pool.borrowInstance(false);
return execute(jedis);
} catch (JedisConnectionException e) {
// Connection error occurred with this Jedis connection, so now make sure to get a known-good one
// The old connection is not given back to the pool since it is defunct anyway
if (_log.isDebugEnabled())
_log.debug("Connection error occurred, discarding Jedis connection: " + e.getMessage());
} catch (Exception e) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wrote catching only JedisConnectionException was too restrictive. I'm not sure about the various jedis failure scenarios / reasons, but maybe there are ones where closing the client does not help? I wonder if the exception handling should be a bit more elaborate here / if the client should only be closed in certain cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had stack traces like this:

loadFromMemcached Could not load session with id 40EE1DDD6CF31EF4959FE0800E3F2BE1.2f87b4cb-8aaf-4f0a-b5f2-35ff18cbe685 from memcached.
java.lang.ClassCastException: java.lang.Long cannot be cast to [B
	at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:259)
	at redis.clients.jedis.BinaryJedis.get(BinaryJedis.java:244)
	at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:129)
	at de.javakaffee.web.msm.storage.RedisStorageClient$3.execute(RedisStorageClient.java:127)
	at de.javakaffee.web.msm.storage.RedisStorageClient$RedisCommandCallable.call(RedisStorageClient.java:181)
	at de.javakaffee.web.msm.storage.RedisStorageClient.get(RedisStorageClient.java:135)
	at de.javakaffee.web.msm.MemcachedSessionService.loadFromMemcached(MemcachedSessionService.java:1142)
	at de.javakaffee.web.msm.MemcachedSessionService.findSession(MemcachedSessionService.java:597)
	at de.javakaffee.web.msm.MemcachedBackupSessionManager.findSession(MemcachedBackupSessionManager.java:218)
	at org.apache.catalina.connector.Request.doGetSession(Request.java:3048)
	at org.apache.catalina.connector.Request.getSessionInternal(Request.java:2662)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:467)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
	at de.javakaffee.web.msm.RequestTrackingHostValve.invoke(RequestTrackingHostValve.java:156)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
	at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:683)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
	at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1115)
	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:318)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:748)

And other exceptions being emitted from inside the Jedis library, like NullPointerException. Of course, these might ultimately be considered bugs in Jedis, but it seems to me that it is just the safest for us not to rely on Jedis being bugfree but just catch every Exception of this single, very specific command. The intention being: If this command fails, then close the connection.

Copy link
Owner

@magro magro Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the mentioned exception an indicator that the jedis instance is used by multiple threads, or - more generally - used in a wrong way, according to redis/jedis#359 or redis/jedis#1184?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding NPEs or other exceptions, how does closing the connection help? Was the connection unusable after such an exception occurred?
Even if we would handle jedis bugs here as a workaround I think these issues should definitely be addressed at the jedis guys.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have a hard time reproducing this (and sending detailed reports to the Jedis developers) since it only happened once during an ElastiCache failover (we had several successful ElastiCache failovers before). I don't think we are using connections in multiple threads concurrently, except if our pool code had errors which I don't see right now.

Anyway, in the incident I reported, after the failover, it seemed that "something" was wrong with the connections and that after restarting the webworkers (which triggered a reconnection of all Jedis instances of course) all was good. That leads to the conclusion that whenever "something" is wrong with the connections, reconnecting is a safe bet, especially as in regular operation there should be no reason for a connection to ever fail.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. Then let's keep it as it is.

// An error occurred with this Jedis connection, so now make sure to get a known-good one
// The old connection is not given back to the pool since it is assumed to be defunct anyway
if (_log.isErrorEnabled())
_log.error("Redis connection error occurred, discarding Jedis instance", e);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is handled and no manual intervention should be needed. Maybe warn would be sufficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can change this if you like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


if (jedis != null)
try { jedis.close(); } catch (Exception e2) { /* ignore */ }
Expand All @@ -209,47 +208,58 @@ public T call() throws Exception {
private class JedisPool {
private Queue<BinaryJedis> _queue = new ConcurrentLinkedQueue<BinaryJedis>();

public BinaryJedis borrowInstance(boolean knownGood) {
public BinaryJedis borrowInstance(boolean reinitializePool) {
if (reinitializePool) {
if (_log.isInfoEnabled())
_log.info("Reinitializing Redis connection pool");

// Shut down pool, effectively closing all connections
// Note that there is a chance that connections which are currently in use are still returned to the
// pool after the shutdown because of a race condition, but since we create a new instance below
// unconditionally we can at least be sure no operation will fail a second time
shutdown();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also another race condition:

  • 2 redis requests R1 and R2 are in flight
  • R1 fails and creates a new, healthy connection C for R3
  • R3 completes and returns the healthy connection to the pool
  • R2 fails
  • This will also close the healthy connection

Most probably this is just inefficient, but in combination with the broad exception handling I fear that it might have worse consequences and might result in continuous connection creation / shutdown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a race, but will not lead to an endless loop. After there are no "unhealthy" connections in the pool anymore, the pool will just build up healthy connections again. I am not sure of the reasons a healthy connection might ever fail if there is no real "connection problem". I would be willing to test this code even on our live systems for a few days to make sure it does not happen too frequently.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't see this as a prerequisite to merging this, as I'd assume that you'd submit a follow up PR if this causes issues :-)


if (_log.isDebugEnabled()) {
_log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) after reinitializing pool",
_uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss")));
}
return createJedisInstance();
}

BinaryJedis res;
if((res = _queue.poll()) == null) {

if ((res = _queue.poll()) == null) {
if (_log.isDebugEnabled()) {
_log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s)",
_uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss")));
}
return createJedisInstance();
}

if (knownGood) {
// Check all existing connections until we find a good one
do {
try {
res.ping();
// Check all existing connections until we find a good one
do {
try {
res.ping();

if (_log.isTraceEnabled())
_log.trace(format("Using known-good connection #%d", _queue.size()));
if (_log.isTraceEnabled())
_log.trace(format("Using known-good connection #%d", _queue.size()));

return res;
} catch (Exception e) {
if (_log.isDebugEnabled())
_log.debug(format("Removing connection #%d since it cannot be pinged", _queue.size()));
return res;
} catch (Exception e) {
if (_log.isDebugEnabled())
_log.debug(format("Removing connection #%d since it cannot be pinged", _queue.size()));

try { res.close(); } catch (Exception e2) { /* ignore */ }
}
} while ((res = _queue.poll()) != null);

// No existing connections are good, so create new one
if (_log.isDebugEnabled()) {
_log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) since all existing connections were bad",
_uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss")));
try { res.close(); } catch (Exception e2) { /* ignore */ }
}
} while ((res = _queue.poll()) != null);

return createJedisInstance();
} else {
if (_log.isTraceEnabled())
_log.trace(format("Using connection #%d", _queue.size()));

return res;
// No existing connections are good, so create new one
if (_log.isDebugEnabled()) {
_log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) since all existing connections were bad",
_uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss")));
}

return createJedisInstance();
}

public void returnInstance(BinaryJedis instance) {
Expand Down