Lease expiration managed by the PersistentLease?
See original GitHub issueHello! I’m facing some spurious lease expiration managed by the PersistentLease
even if there’re no network partition or hardware overloading issue.
The problem is that, sometimes, all leases managed by PersistentLease
s get expired at the etcd server side and never go back active again (or re-granted) until the etcd server restarts. Actually a persistent lease instance is not notified of LeaseState.EXPIRED
state even when a lease is actually expired at the etcd server when the issue hits. Interestingly, an EXPIRED
event is fired immediately followed by an ACTIVE
event fired when the client is reconnected to the restarted etcd server.
I believe (from some observation and code inspection) that the persistent lease monitors lease state and re-creates expired (not closed) lease and exposes its id through PersistentLease.getLeaseID()
once a lease id renewed so I send ttl request to assure the lease is OK to be related to an entity. (if ttl > 0 part) Here’s roughly what I’m doing to create/refresh a PersistentLease-tied entity.
long getValidLease(PersistentLease lease) {
validLease = -1;
// lightly spin until I get a valid ttl response and id.
// normally the body gets executed exactly once.
do {
// omitted: throw if the lease is CLOSED
// since lease.getLeaseId() not guarantees a validness of the lease id,
// I chose to use direct TTL request to query its state.
ttlResp = etcdLease.ttl(lease.getLeaseId()); // lease id is updated by the event loop
if (ttl > 0)
validLease = ttlResp.getID();
} while (lease.getCurrentTtlSecs() < 1); // also gets updated by the event loop
return validLease;
}
long count(ByteString key) {
return etcdKV.get(key).countOnly().async()
.get(1000ms).getCount(); // 1 second timed wait-and-get
}
// operation PUT
long validLease = getValidLease(persistentLease);
etcdKV.put(key, data, validLease);
// operation REFRESH
if (count(key) == 0) {
PUT_OPERATION(key, data); // put operation right above
}
All the entities (not many, < 20) get refreshed every 5 seconds. But after the spurious lease expiration all operations hang at the do-while loop in the getValidLease get expired lease ids through getValidLease
and following operations fail because given lease id is already expired.
The etcd server looks OK: at that moment the etcd debug log shows that TTL requests from the do-while loop arrive and get answered at very high rate (due to the do-while loop) and further requests from clients (like etcdctl provided with the server distribution) get properly handled, and even granting a new lease from the same etcd-java client and making it persistent succeeds! It seems that the internal grpc client and event loop assigned with a persistent lease fail to handle responses from the server for some reason.
The issue appears randomly regardless of the server load status. As mentioned earlier, one simple solution for this is to restart the etcd server. After etcd-java reconnects to the restarted server and then all the operations work as expected again.
The etcd server (single instance configuration) is deployed in a small testbed and a spring boot application using etcd-java is also running at the same host, which means the client connects to the etcd server using localhost
as the address.
Is there any recommended way dealing with the validness of a persistent lease, or am I missing something crucial?
Issue Analytics
- State:
- Created 4 years ago
- Comments:21 (11 by maintainers)
Thanks @hsyhsw, no need to include a jar, just (preferably minimal) source code, e.g. just a class with main method would be great.
Great, thanks @hsyhsw! (though I know it took a long time to show up last time you tried so maybe it’s not definite yet…)