Improvement of JedisClusterInfoCache#renewClusterSlots
See original GitHub issueRecently I noticed the number of threads of a java application running on production increased significant and then recovered in a short time(~1m), many threads has stacktrace below:
"thrift-worker-715" Id=11815 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@4fb62711 owned by "thrift-worker-718" Id=11818
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@4fb62711
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at redis.clients.jedis.JedisClusterInfoCache.getSlotPool(JedisClusterInfoCache.java:234)
at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:62)
at redis.clients.jedis.JedisClusterConnectionHandlerWraper.getConnectionFromSlot(JedisClusterConnectionHandlerWraper.java:103)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:116)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:31)
and thread thrift-worker-718 stacktrace:
"thrift-worker-718" Id=11818 RUNNABLE (in native)
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:127)
at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:196)
at redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
at redis.clients.jedis.Protocol.process(Protocol.java:151)
at redis.clients.jedis.Protocol.read(Protocol.java:215)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:340)
at redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:239)
at redis.clients.jedis.BinaryJedis.quit(BinaryJedis.java:253)
at redis.clients.jedis.JedisFactory.destroyObject(JedisFactory.java:88)
at org.apache.commons.pool2.impl.GenericObjectPool.destroy(GenericObjectPool.java:921)
at org.apache.commons.pool2.impl.GenericObjectPool.invalidateObject(GenericObjectPool.java:626)
at redis.clients.util.Pool.returnBrokenResourceObject(Pool.java:101)
at redis.clients.jedis.JedisPool.returnBrokenResource(JedisPool.java:239)
at redis.clients.jedis.JedisPool.returnBrokenResource(JedisPool.java:16)
at redis.clients.jedis.Jedis.close(Jedis.java:3407)
at redis.clients.jedis.JedisClusterInfoCache.renewClusterSlots(JedisClusterInfoCache.java:110)
at redis.clients.jedis.JedisClusterConnectionHandler.renewSlotCache(JedisClusterConnectionHandler.java:52)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:135)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:141)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:141)
at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:141)
at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:31)
It’s seem many thread wait a write lock which locked by another thread to be unlock, but i/o operation was pretty slow, client may have network issue with one of redis nodes. The renewClusterSlots operation take 500+ms(connectTimeout 200ms + soTimeout 300ms) in our production. It caused lots of operations timed out.
So I looked into source code of redis.clients.jedis.JedisClusterInfoCache#renewClusterSlots
There are two points I think may improve: 1. reduce lock granularity, move i/o operation out of lock block 2. call renewClusterSlots with explicit exclude redis node which cause IOException (it may have network issue with client or just down)
I would like to open a PR to do this, any thoughts?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Resolved by #2514
@sazzad16 Could you please take a look about this #2514?