question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

_version does not uniquely identify a particular version of a row

See original GitHub issue

Hi there!

I’ve been looking in to crate.io recently, and I’ve found something that feels a little bit surprising. If you perform a number of concurrent blind updates to a 0.54.9-1~jessie cluster which experiences a network partition, it’s possible for two reads of a single row to have the same _version but different values.

For instance, here are two primary-key reads (select (value, _version) from registers where id = ?) which returned different values for the same version of the row:

      [{:value 8631, :_version 8622} {:value 8625, :_version 8622}]]

Or repeated reads which do not agree:

      [{:value 8570, :_version 8568}
       {:value 8570, :_version 8568}
       {:value 8572, :_version 8568}
       {:value 8570, :_version 8568}
       {:value 8572, :_version 8568}
       {:value 8570, :_version 8568}
       {:value 8572, :_version 8568}]])},

Crate’s optimistic concurrency control docs say things like “This [version] is increased by 1 on every update” and “Querying for the correct _version ensures that no concurrent update has taken place.” This suggests to me that _version should uniquely identify a version of a row. This undermines the safety of Crate’s concurrency model: even if conditional updates based on _version are safe, if clients can’t agree on what value a particular _version identifies, there’s no way to avoid concurrency anomalies. I imagine lost updates might be possible. This might also affect the safety of SQL update statements which do not rely explicitly on _version: for instance, UPDATE foo SET visits = visits + 1, but I haven’t tested those yet.

You can reproduce this behavior by cloning Jepsen at b25e636f and running lein test in crate/, with the standard five-node setup; see Jepsen’s docs for details. Or, you should be able to reproduce them yourself, by having five clients, each bound to one host of a five-node cluster, perform concurrent writes to a single row, and having five more clients perform concurrent reads, recording their _versions. A ~200 second network partition isolating each node from two of its neighbors, forming an overlapping ring topology, appears to be sufficient to induce this behavior–but that’s literally the first failure mode I tried, so there may be simpler ones.

As advised, I’m using an explicit expected-nodes count, majority values for all minimum-master-limits in the config file, and I’ve lowered some timeouts to speed up the testing process. The table is a simple (pkey, value) table, replicated to all nodes.

I suspect this issue stems from, (and also affects) whatever underlying ElasticSearch version you’re using, but it’s possible those problems have been resolved in 5.0.0. As a courtesy to your customers, may I recommend you adopt their resiliency status as a part of your documentation, so users know what behaviors they can expect?

<bountysource-plugin>

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource. </bountysource-plugin>

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
aphyrcommented, Jun 28, 2016

Hiya! I’ve written up a more thorough description of the test and results here: https://aphyr.com/posts/332-jepsen-crate-0-54-9-version-divergence, and the latest Jepsen commits include a more more reliable failure schedule. I can also confirm that this fault manifests with simple network partitions, in addition to overlapping majorities.

0reactions
seutcommented, Jul 5, 2022

Since CrateDB > 4.0.0 and the introduction of _seq_no + _primary_term and documentation to use it for Optimistic Concurrency control instead of the deprecated _version column, this issue has been fixed. See also https://crate.io/docs/crate/reference/en/4.8/appendices/resiliency.html#version-number-representing-ambiguous-row-versions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

_version does not uniquely identify a particular version of a ...
Currently I've no real idea why this is happening, my guess is that some reads are reading a stale version value but I...
Read more >
variable date does not uniquely identify observations in the ...
I was trying to merge two data sets using "date variable" as common. One of the data sets is a panel data with...
Read more >
unique identifier in data.table - r - Stack Overflow
I am looking for something like isid in Stata, which checks whether the specified variables uniquely identify the observations. Can someone ...
Read more >
Data Wrangling in Stata: Combining Data Sets
Because we've specified that this is a 1:1 merge, the identifier variable(s) must uniquely identify observations in both data sets.
Read more >
3. Gathering and preparing the data set
Absence of variables that uniquely identify each record of the dataset ... Work on this new version, leaving the original data files untouched....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found