question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MongoObserver possible race condition for multiple runs

See original GitHub issue

Hi and thanks for your awesome work,

It seems MongoObserver have some race conditionswhen logging metrics.

Context:

  • We use a scheduler to run our experiments on a cluster
  • Each experiment is run in its own docker
  • Possibly, multiple experiments are running at the same time (each in their own docker) and accessing the same DB

The issue: We ran ~100 experiments (same code, just different hyperparams for the architecture). When looking at the results in Omniboard some experiments seem to have overlapped metrics plots: exp_overlap

while some are ok: exp_nooverlap

When digging a bit with pymongo it appears that the one with weird plot have their run[‘info’] dict empty:

{}

While the ones with ‘ok plots’ have their info dicts

{'metrics': [{'id': '5ca213657ed6d385b4b7c0aa', 'name': 'epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0ac', 'name': 'top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0ae', 'name': 'top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b0', 'name': 'val_epoch_loss'},
  {'id': '5ca213657ed6d385b4b7c0b2', 'name': 'val_top1_acc'},
  {'id': '5ca213657ed6d385b4b7c0b4', 'name': 'val_top5_acc'},
  {'id': '5ca213657ed6d385b4b7c0b6', 'name': 'lr'},
  {'id': '5ca216687ed6d385b4b7e1dd', 'name': 'test_top1_acc'},
  {'id': '5ca216687ed6d385b4b7e1df', 'name': 'test_top5_acc'}]}
  • The code is the same between all experiments
  • The issue appears to be totally random, so really hard to reproduce
  • The run[‘info’] dict is never modified in my code
  • I don’t know if the run[‘info’] dict is erased at the end of training of if it is downright not created at the begining of the experiment
  • The issue only appears when we run a lot of experiments in parallel
  • The weird thing is that we have all the experiments with different ids, so no ‘overlap’ strictly speaking
  • So the overlap display may come from Omniboard which use random metrics as display (because how the hell does it find metrics when the fs ids are not present ?!)
  • Still the erasing of the metrics is an issue

possible related issues: #309 #345 #317

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
F-Bartocommented, May 7, 2019

Okay finally found it,

The person having the issue was using pymongo and not omniboard to delete the runs. At the same time, he did not delete the corresponding documents in the metrics collection. Hence the overlap of id on the metrics at some point.

Still, the fact that run[‘info’] is empty when the metrics documents already exist is weird.

Thx for your help all

0reactions
flukeskywalkercommented, Apr 12, 2019

@F-Barto: I doubt that that’s the issue. The metrics don’t have much to do with the info. See here: https://sacred.readthedocs.io/en/latest/collected_information.html#metrics-records

You can directly inspect the metrics for the problematic runs using pymongo. If you still see the same issue, the problem is not coming from Omniboard, and it is likely an ID issue (make sure that you’re using the latest Sacred). If your problem is indeed coming from Omniboard, it is probably best to open an issue in the corresponding repo.

@vnmabus: Yes that’s been on my TODO list, and I’ll check soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Race condition while updating single document multiple times
In this situation we are seeing race condition and always second line item attributes ... Most likely the issue is with your update...
Read more >
Prevent MongoDB race condition in update query during ...
This works as expected. But I noticed when multiple requests are sent from the same user at the same time, the license_count is...
Read more >
What is a Race Condition? - TechTarget
A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the...
Read more >
https://11535138991246188740.googlegroups.com/atta...
... Fixed pencil rendering during the race condition between clear annotations and ... flag change inside the Audio Manager instead of using Mongo...
Read more >
Race condition in monodb updates? - help - Meteor forums
I seem to have a problem with batch-updates that I run in the ... which should ensure that only available resources are taken...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found