question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discussion: Sequence ID validation race condition when requesting snapshot via REST

See original GitHub issue

For exchanges that request the l2 snapshot over REST there is a pretty common race condition issue that can lead to missing order book data.

The problem w/example

I’ll use Bittrex as an example since the issue happens often for me with Bittrex. The culprit code looks like this:

 _sendSubLevel2Updates(remote_id, market) {
    this._requestLevel2Snapshot(market);
    this._wss.send(
      JSON.stringify({
        H: "c3",
        M: "Subscribe",
        A: [[`orderbook_${remote_id}_${this.orderBookDepth}`]],
        I: ++this._messageId,
      })
    );
  }

The issue is as follows:

  1. I call subscribeLevel2Updates()
  2. First it makes a REST request to get the snapshot via this._requestLevel2Snapshot(market)
  3. While that REST request is sent, the ws "Subscribe" event is sent to subscribe to l2 updates.
  4. The REST call gets its response first, and emits a l2snapshot event with sequenceId of “100”.
  5. The ws stream for l2 updates starts arriving, with the first event having a sequenceId of “105”.

The snapshot has sequenceId of “100” while the first update has a sequenceId of “105”. This means you missed four updates and your order book will never be valid.

Solutions

  • Subscribing to l2 updates first and waiting to receive the first l2 update before requesting the snapshot over REST would work and I believe this may be the best backwards-compatible fix. You would have to ignore any of the first few l2 updates that arrived w/a sequenceId before the snapshot’s sequenceId, but this issue already exists w/snapshots over REST in the current implementation. I imagine something like this below (not run or tested):
 _sendSubLevel2Updates(remote_id, market) {
   let hasRequestedSnapshot = false;
   this.on('l2update', (data, updateMarket) => {
      if (hasRequestedSnapshot) return;
      // when first l2update arrives, request the snapshot so we know it will be at a point in time
      // that is >= when the first l2 update was received, to avoid a gap between snapshot and first l2 update
      if (updateMarket.id === market.id) {
        hasRequestedSnapshot = true;
        this._requestLevel2Snapshot(market);
      }
    });
    this._wss.send(
      JSON.stringify({
        H: "c3",
        M: "Subscribe",
        A: [[`orderbook_${remote_id}_${this.orderBookDepth}`]],
        I: ++this._messageId,
      })
    );
  }
  • As a solution outside of this library, the code using ccxws could just detect this scenario and call client._requestLevel2Snapshot(market) again to get the newer snapshot, then ignore the updates that arrive before the newer snapshot’s sequenceId. This will work and is what I’m doing right now, but I’m wondering if there’s a better way to handle this in a generic way.
  • If there is no solution then I think this scenario should at least be documented prominently somewhere.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
evan-coygocommented, Jan 14, 2021

I’ll be honest, I didn’t know that was a thing. Great idea though to avoid clutter in Issues. Discussion created here https://github.com/altangent/ccxws/discussions/255, I’m closing this issue

1reaction
bmancini55commented, Jan 14, 2021

@evan-coygo, thanks for submitting the issue! Yeah, this is sort of a known issue that’s shows its ugly head when you try to build certain books.

To rehash, the issue is that the socket stream starts after the sequenceId returned by the REST API. In many cases I think this is due to request caching of the REST API. So you end up with a slightly stale snapshot (lower sequenceId) than you have obtained via the socket stream. It could also just be that establishing socket subscriptions takes longer than the REST request for some exchanges. I digress on the cause.

From a code perspective, to avoid this issue, you need to perform the snapshot after the subscription has been confirmed and you’ve queued a sufficient number of messages (intentionally ambiguous as the sufficient depth is likely exchange and latency specific).

In the past I’ve recommend monkey patching a timestamp delay over _requestLevel2Snapshot(market) for the client:

const REST_DELAY_MS = 500
client._originalRequestLevel2Snapshot = client._requestLevel2Snapshot;
client._requestLevel2Snapshot = market => setTimeout(() => client._originalRequestLevel2Snapshot(market), REST_DELAY_MS);

However, this solution isn’t foolproof and may introduce some funkiness on reconnections.

Ideally (as usually), changes after the refactor (#149) which will include subscription success/failure (#103) combined with firing the snapshot request after a delay after subscription success would be best.

As a fallback though, I think you need to be prepared to call _requestLevel2Snapshot if you receive a snapshot older than your update stream. Which totally should certainly be documented!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Prevent race conditions when applying unpublished events
Stream position is used to determine if race condition on write occurred. All reactions.
Read more >
Handling API request race conditions in React
Let's suppose 3 requests R1, R2 and R3 gets fired in this order, and are still pending. The solution is to only handle...
Read more >
What is a Race Condition? - TechTarget
A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the...
Read more >
how to deal with race conditions in a RESTful application?
The first request understands that the data are expired. What should it do? Fail the request and return an error to a client?...
Read more >
Designing Data Intensive Applications: Write Operation Race ...
If we characterise a race condition as “an undesirable situation that occurs when a device or system attempts to perform two or more...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found