question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Undefined behavior for ZSCAN using readonly cluster client.

See original GitHub issue

While using a ClusterReadOnlyConnectionPool, the zscan and zscan_iter will not successfully provide the scan guarantees. I believe this will apply to sscan and hscan as well, though I’ve not tested.

The issue arises from the overridden get_node_by_slot on ClusterReadOnlyConnectionPool. This will randomly choose between any master or slave node that has the slot for the key. So as subsequent scan commands are issued, it will issue them to random nodes. The cursor values Redis returns are stateless, they are effectively offsets, however, they’re not guaranteed to be consistent between master and slave.

An example I used to debug this was on a sorted set that currently contains 164 values, but has fluctuated significantly over its lifetime. If I issue an initial ZSCAN request to the master that owns it, I get the following response:

Command: ZSCAN mysortedset 0

1) "136"
2)  1) "2836834"
    2) "0.4802068521853653"
    3) "3599906"
    4) "0.4656334842469258"
    5) "3490931"
    6) "0.22173393426291121"
    7) "82109"
    8) "0.48914307544693797"
    9) "1244405"
   10) "0.53974599172088655"
   11) "2199081"
   12) "0.34818095160929963"
   13) "3967992"
   14) "0.49706414372896185"
   15) "1390822"
   16) "0.20662256529819331"
   17) "540680"
   18) "0.53718780580831582"
   19) "2840317"
   20) "0.20240259687939222"
   21) "812937"
   22) "0.16832229396956788"
   23) "2181749"
   24) "0.23085035776582155"

Note: cursor value of 136.

If I run this same command on the slave node that has this slot, I get the following results.

1) "240"
2)  1) "1078146"
    2) "0.19975230285488776"
    3) "3788365"
    4) "0.59107889142186887"
    5) "3195524"
    6) "0.30029325316059524"
    7) "1325801"
    8) "0.42741925550104209"
    9) "769388"
   10) "0.19214136348401703"
   11) "3718988"
   12) "0.22575183419216338"
   13) "3580511"
   14) "0.3962135436706839"
   15) "380687"
   16) "0.3458031319795174"
   17) "1999627"
   18) "0.60407053063340199"
   19) "1274471"
   20) "0.37309465665899166"

Note: Different values, fewer values, and different cursor offset.

While the underlying set doesn’t change, reissuing these commands on the same server does provide consistent results, but switching between servers means that you end up with an arbitrary subset of what’s actually in the sorted set. This can be demonstrated through the python library with the following (called on the same set as above with no underlying changes in between)

Using the readonly cluster client, called in quick succession:

In [41]: len(set([i for i, _ in readonly_cluster.zscan_iter("mysortedset")]))
Out[41]: 120

In [42]: len(set([i for i, _ in readonly_cluster.zscan_iter("mysortedset")]))
Out[42]: 124

In [43]: len(set([i for i, _ in readonly_cluster.zscan_iter("mysortedset")]))
Out[43]: 122

In [44]: len(set([i for i, _ in readonly_client.zscan_iter("mysortedset")]))
Out[44]: 125

Using a non-readonly cluster client, called in quick succession

In [45]: len(set([i for i, _ in cluster.zscan_iter("mysortedset")]))
Out[45]: 164

In [46]: len(set([i for i, _ in cluster.zscan_iter("mysortedset")]))
Out[46]: 164

In [47]: len(set([i for i, _ in cluster.zscan_iter("mysortedset")]))
Out[47]: 164

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:1
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Grokzencommented, Mar 6, 2018

@theanti9 I am working on a fix, it should not be that difficult to solve, i just have to identify all methods that require this change and patch them inside the ReadonlyConnectionPool. Will update when i am done with it.

0reactions
Grokzencommented, Sep 20, 2020

Just a note, all xSCAN style methods is either broken or do not work properly in both normal cases and in failover cases. They need another pass to sort them out and to make them work across multiple nodes in the cluster and to work with the cursor mechanism. I will put this into the 3.0.0 backlog to take another big look at all xSCAN methods and sort them out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Lettuce Reference Guide
Lettuce is a scalable thread-safe Redis client based on netty and Reactor. ... using ConnectionIntent to obtain read-only connections.
Read more >
SCAN - Redis
The SCAN command and the closely related commands SSCAN , HSCAN and ZSCAN are used in order to incrementally iterate over a collection...
Read more >
types/ioredis/index.d.ts - UNPKG
61, replyEncoding?: string | null | undefined; ... 1060, client: OverloadedSubCommand<ValueType, any>; ... 1176, readonly redis: Redis | Cluster;.
Read more >
Learn why Redis client read requests are read from or ... - AWS
The default behavior of replica nodes in cluster-mode enabled clusters is to redirect all client read/write requests to an authoritative ...
Read more >
Redis client for Golang - Go Packages
Use expiration for `SETEX`-like behavior. Zero expiration means the key has no expiration time. Example ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found