question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MongoDB cluster issues: retry on disconnect does not fail over, and write errors do not fail over

See original GitHub issue

In the event that a member of the replica set becomes unable to respond during a particular find() or insert() or similar operation, there is no automatic failover. The operation fails with an error. Our preference would be to automatically retry it.

According to notes by Thomas Chraibi based on input from Charles Sarrazin of MongoDB, one approach to achieve that is to simply re-attempt the find() or insert() call in question. This will result in the MongoDB driver discovering a functioning replica set node to reconnect to, as in the following pseudocode:

const client = ...
const coll = client.db('cool').collection('llamas');

function attemptInsert(doc, callback) {
  coll.insert(doc, (err, result) => {
    if (err) {
      // you might want to check for some errors, as they might be unrecoverable
      // if (err.msg === 'SomeImportantErrorType') return callback(new Error('unrecoverable!'));

      // now recursively call `attemptInsert` to perform server selection again
      attemptInsert(doc, callback);
      return;
    }

    callback(); // success
  });
}

In addition, there is apparently some sort of issue with our autoReconnect configuration:

        autoReconnect: true,
        // retry forever
        reconnectTries: Number.MAX_VALUE,
        reconnectInterval: 1000

Apparently this will keep retrying the connection to a node that is down for as long as that node is down, which is not ideal.

However it is unclear to me why this should occur, while find() and insert() operations apparently will continue to make new connections to other nodes as needed according to the pseudocode that was provided above.

So, more clarification is needed on the following points before implementation can be completed:

  • In what situation does the autoReconnect behavior come into play?
  • If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node?
  • If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be?
  • What MongoDB errors can be safely classed as “this node is ill or unavailable,” as opposed to “you did something you should not have” (examples: oversize document, illegal operation, unique key, etc)?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
boutellcommented, Feb 16, 2019

We already fixed the retry on disconnect stuff, and mongo 3+ retryable writes are available via a mongodb URI, therefore this can be closed.

0reactions
boutellcommented, Aug 17, 2018

That helps a lot! Thank you. We’ll discuss and explore.

We are not sharding so it sounds like the mongos caveat doesn’t apply.

I take your point that it’s possible the write could make it to the oplog twice (i.e. in some scenario be carried out twice) with a retry strategy that doesn’t rely on your new retryable writes. Our client may be OK with using retryable writes and the required driver and server versions, given that the apostrophe-db-mongo-3-driver module has shipped.

On Fri, Aug 17, 2018 at 9:17 AM, Matt Broadstone notifications@github.com wrote:

Ah, that’s a huge help, thank you. So some developers think the driver does more than it does (the autoreconnect misunderstanding), and others think it does less than it does (:

Indeed! Unfortunately, it’s proven somewhat difficult to find time to properly document all of this in my short tenure here. My improved SDAM layer should also alleviate much of this, and will include design documentation - more on that later.

So a reasonable strategy would be:

  • Don’t use autoreconnect.

IMHO, if your goal is resiliency then I would simply never use autoreconnect. There is a marginal case for its use with a single server connection, but only just so.

  • If an individual request fails, and the error code smells like it’s network-y or broken-node-y, simply try that request again (up to some limit of our choosing).

Yep! But I want to make some things very clear: this implementation of retryability (all client side) is subject to errors for writes specifically. The design of retryable writes requires a server-side transaction number, which allows us to verify that a write was made to the oplog at most one time. If you implement retryability on the client side using the pseudo-code I provided above, you run the risk of having multiple writes reach the oplog.

We can do that.

Just to confirm, this means that as long as you’re connected to a replica set, the driver is capable of eventually getting “back on the air” even if connection is completely lost to all of the nodes for a period of time? To the point where the TCP connection(s) close(s)?

Yes. If you have a spare afternoon (ha ha), you might want to peruse our Server Discovery and Monitoring specification https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst. The node driver presently implements most of this specification, and will maintain an active monitoring thread for each seed in a seedlist for the duration of your application. During this time, it will continuously update its internal knowledge of the topology, and use up-to-date information each time an operation is executed. This provides high availability, and because the isMaster responses from each node in the replicaset contains knowledge about new and removed members, the internal state of the driver keeps up even with added nodes not in the initial seed list.

That is, if we connect with this URI, the driver can figure it out and try other nodes for as long as it has to, and eventually even notice the original node is back:

mongodb://localhost:27018,localhost:27019,localhost:27020

(Or mongodb+srv, of course)

Unfortunately, mongos instances do not naturally monitor their topology so the plasticity described above does not apply to them. The initially provided seedlist, if they are mongos instances, will be the static list of monitored seeds in the driver for the duration of the application.

But with a single-server URI like this, we would have to use the autoreconnect option, or else reconnect ourselves:

mongodb://localhost:27017

A little more background on the work that is presently going into the driver. The mongo driver right now has a sort of broken concept of “connecting” to a topology (e.g. MongoClient.connect). What’s really going on when you provide your connection string is that the driver is parsing it, finding individual seeds, handshaking and starting monitors for each of them, building an internal view of the topology and creating a connection pool for each known node. When you execute an operation, it first does server selection based on your read preference (by default this is primary), selects that server, then requests one connection from the pool associated with the server and sends the request along. The new SDAM layer actually allows us to drop MongoClient.connect completely:

const client = new MongoClient(‘mongodb://localhost:27017’); client.db(‘foo’).collection(‘bar’).insertOne({ some: ‘document’ }, (err, res) => { // do something });

And thus is a more pure form of what we were talking about above - there is no real concept of “connecting”. You simply are always asking to execute some operation, against some preference of server, and its up the to driver to figure it out for you.

Finally, I mentioned retryable writes above, but in the next two quarters we will be specifying and implementing retryable reads. This will boil down to the same thing you are likely to be implementing soon for Apostrophe (if that’s the path you choose), in that it will retry a read operation (only starting them, not a getMore) if some sort of “retryable error” occurs (network blip, not master, etc).

Hope that helps

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apostrophecms/apostrophe/issues/1508#issuecomment-413862558, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB9fax-Wv-ONjmGIXoHNuIPnG2CrtI8ks5uRsJogaJpZM4VkjzS .

THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT P’UNK AVENUE | (215) 755-1330 | punkave.com

Read more comments on GitHub >

github_iconTop Results From Across the Web

Node.js Mongodb Driver reuse connection and reconnect to ...
I tested loosing a connection to a cluster while operating on mongoDB.dbObj() after the original connection is made. I disconnected my network ...
Read more >
Troubleshoot Connection Issues — MongoDB Atlas
Too many open connections to your database deployment ... Atlas sets limits for concurrent incoming connections to a database deployment. For clusters, this...
Read more >
Gracefully reconnecting with the Go driver - MongoDB
This is a general question regarding client software written using any of the MongoDB drivers. What is the best practice for handling a ......
Read more >
Retryable Writes — MongoDB Manual
MongoDB retryable writes make only one retry attempt. This helps address transient network errors and replica set elections, but not persistent network errors....
Read more >
Failover not working as advertised - Ops and Admin - MongoDB
Primary node is having an issue and not failing over to secondary! Result is that the entire service is down for us and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found