-
Notifications
You must be signed in to change notification settings - Fork 349
Better handle Redis failure #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,7 +20,6 @@ | |
| import org.apache.juli.logging.Log; | ||
| import org.apache.juli.logging.LogFactory; | ||
| import redis.clients.jedis.BinaryJedis; | ||
| import redis.clients.jedis.exceptions.JedisConnectionException; | ||
|
|
||
| import java.net.URI; | ||
| import java.net.URISyntaxException; | ||
|
|
@@ -178,11 +177,11 @@ public T call() throws Exception { | |
| try { | ||
| jedis = _pool.borrowInstance(false); | ||
| return execute(jedis); | ||
| } catch (JedisConnectionException e) { | ||
| // Connection error occurred with this Jedis connection, so now make sure to get a known-good one | ||
| // The old connection is not given back to the pool since it is defunct anyway | ||
| if (_log.isDebugEnabled()) | ||
| _log.debug("Connection error occurred, discarding Jedis connection: " + e.getMessage()); | ||
| } catch (Exception e) { | ||
| // An error occurred with this Jedis connection, so now make sure to get a known-good one | ||
| // The old connection is not given back to the pool since it is assumed to be defunct anyway | ||
| if (_log.isErrorEnabled()) | ||
| _log.error("Redis connection error occurred, discarding Jedis instance", e); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The exception is handled and no manual intervention should be needed. Maybe warn would be sufficient? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I can change this if you like. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
|
|
||
| if (jedis != null) | ||
| try { jedis.close(); } catch (Exception e2) { /* ignore */ } | ||
|
|
@@ -209,47 +208,58 @@ public T call() throws Exception { | |
| private class JedisPool { | ||
| private Queue<BinaryJedis> _queue = new ConcurrentLinkedQueue<BinaryJedis>(); | ||
|
|
||
| public BinaryJedis borrowInstance(boolean knownGood) { | ||
| public BinaryJedis borrowInstance(boolean reinitializePool) { | ||
| if (reinitializePool) { | ||
| if (_log.isInfoEnabled()) | ||
| _log.info("Reinitializing Redis connection pool"); | ||
|
|
||
| // Shut down pool, effectively closing all connections | ||
| // Note that there is a chance that connections which are currently in use are still returned to the | ||
| // pool after the shutdown because of a race condition, but since we create a new instance below | ||
| // unconditionally we can at least be sure no operation will fail a second time | ||
| shutdown(); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's also another race condition:
Most probably this is just inefficient, but in combination with the broad exception handling I fear that it might have worse consequences and might result in continuous connection creation / shutdown. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a race, but will not lead to an endless loop. After there are no "unhealthy" connections in the pool anymore, the pool will just build up healthy connections again. I am not sure of the reasons a healthy connection might ever fail if there is no real "connection problem". I would be willing to test this code even on our live systems for a few days to make sure it does not happen too frequently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't see this as a prerequisite to merging this, as I'd assume that you'd submit a follow up PR if this causes issues :-) |
||
|
|
||
| if (_log.isDebugEnabled()) { | ||
| _log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) after reinitializing pool", | ||
| _uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss"))); | ||
| } | ||
| return createJedisInstance(); | ||
| } | ||
|
|
||
| BinaryJedis res; | ||
| if((res = _queue.poll()) == null) { | ||
|
|
||
| if ((res = _queue.poll()) == null) { | ||
| if (_log.isDebugEnabled()) { | ||
| _log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s)", | ||
| _uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss"))); | ||
| } | ||
| return createJedisInstance(); | ||
| } | ||
|
|
||
| if (knownGood) { | ||
| // Check all existing connections until we find a good one | ||
| do { | ||
| try { | ||
| res.ping(); | ||
| // Check all existing connections until we find a good one | ||
| do { | ||
| try { | ||
| res.ping(); | ||
|
|
||
| if (_log.isTraceEnabled()) | ||
| _log.trace(format("Using known-good connection #%d", _queue.size())); | ||
| if (_log.isTraceEnabled()) | ||
| _log.trace(format("Using known-good connection #%d", _queue.size())); | ||
|
|
||
| return res; | ||
| } catch (Exception e) { | ||
| if (_log.isDebugEnabled()) | ||
| _log.debug(format("Removing connection #%d since it cannot be pinged", _queue.size())); | ||
| return res; | ||
| } catch (Exception e) { | ||
| if (_log.isDebugEnabled()) | ||
| _log.debug(format("Removing connection #%d since it cannot be pinged", _queue.size())); | ||
|
|
||
| try { res.close(); } catch (Exception e2) { /* ignore */ } | ||
| } | ||
| } while ((res = _queue.poll()) != null); | ||
|
|
||
| // No existing connections are good, so create new one | ||
| if (_log.isDebugEnabled()) { | ||
| _log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) since all existing connections were bad", | ||
| _uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss"))); | ||
| try { res.close(); } catch (Exception e2) { /* ignore */ } | ||
| } | ||
| } while ((res = _queue.poll()) != null); | ||
|
|
||
| return createJedisInstance(); | ||
| } else { | ||
| if (_log.isTraceEnabled()) | ||
| _log.trace(format("Using connection #%d", _queue.size())); | ||
|
|
||
| return res; | ||
| // No existing connections are good, so create new one | ||
| if (_log.isDebugEnabled()) { | ||
| _log.debug(format("Creating new Jedis instance (host=%s, port=%s, ssl=%s) since all existing connections were bad", | ||
| _uri.getHost(), _uri.getPort(), _uri.getScheme().startsWith("rediss"))); | ||
| } | ||
|
|
||
| return createJedisInstance(); | ||
| } | ||
|
|
||
| public void returnInstance(BinaryJedis instance) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you wrote catching only JedisConnectionException was too restrictive. I'm not sure about the various jedis failure scenarios / reasons, but maybe there are ones where closing the client does not help? I wonder if the exception handling should be a bit more elaborate here / if the client should only be closed in certain cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I had stack traces like this:
And other exceptions being emitted from inside the Jedis library, like NullPointerException. Of course, these might ultimately be considered bugs in Jedis, but it seems to me that it is just the safest for us not to rely on Jedis being bugfree but just catch every Exception of this single, very specific command. The intention being: If this command fails, then close the connection.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the mentioned exception an indicator that the jedis instance is used by multiple threads, or - more generally - used in a wrong way, according to redis/jedis#359 or redis/jedis#1184?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding NPEs or other exceptions, how does closing the connection help? Was the connection unusable after such an exception occurred?
Even if we would handle jedis bugs here as a workaround I think these issues should definitely be addressed at the jedis guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have a hard time reproducing this (and sending detailed reports to the Jedis developers) since it only happened once during an ElastiCache failover (we had several successful ElastiCache failovers before). I don't think we are using connections in multiple threads concurrently, except if our pool code had errors which I don't see right now.
Anyway, in the incident I reported, after the failover, it seemed that "something" was wrong with the connections and that after restarting the webworkers (which triggered a reconnection of all Jedis instances of course) all was good. That leads to the conclusion that whenever "something" is wrong with the connections, reconnecting is a safe bet, especially as in regular operation there should be no reason for a connection to ever fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. Then let's keep it as it is.