MongoObserver possible race condition for multiple runs
See original GitHub issueHi and thanks for your awesome work,
It seems MongoObserver have some race conditionswhen logging metrics.
Context:
- We use a scheduler to run our experiments on a cluster
- Each experiment is run in its own docker
- Possibly, multiple experiments are running at the same time (each in their own docker) and accessing the same DB
The issue:
We ran ~100 experiments (same code, just different hyperparams for the architecture). When looking at the results in Omniboard some experiments seem to have overlapped metrics plots:
while some are ok:
When digging a bit with pymongo it appears that the one with weird plot have their run[‘info’] dict empty:
{}
While the ones with ‘ok plots’ have their info dicts
{'metrics': [{'id': '5ca213657ed6d385b4b7c0aa', 'name': 'epoch_loss'},
{'id': '5ca213657ed6d385b4b7c0ac', 'name': 'top1_acc'},
{'id': '5ca213657ed6d385b4b7c0ae', 'name': 'top5_acc'},
{'id': '5ca213657ed6d385b4b7c0b0', 'name': 'val_epoch_loss'},
{'id': '5ca213657ed6d385b4b7c0b2', 'name': 'val_top1_acc'},
{'id': '5ca213657ed6d385b4b7c0b4', 'name': 'val_top5_acc'},
{'id': '5ca213657ed6d385b4b7c0b6', 'name': 'lr'},
{'id': '5ca216687ed6d385b4b7e1dd', 'name': 'test_top1_acc'},
{'id': '5ca216687ed6d385b4b7e1df', 'name': 'test_top5_acc'}]}
- The code is the same between all experiments
- The issue appears to be totally random, so really hard to reproduce
- The run[‘info’] dict is never modified in my code
- I don’t know if the run[‘info’] dict is erased at the end of training of if it is downright not created at the begining of the experiment
- The issue only appears when we run a lot of experiments in parallel
- The weird thing is that we have all the experiments with different ids, so no ‘overlap’ strictly speaking
- So the overlap display may come from Omniboard which use random metrics as display (because how the hell does it find metrics when the fs ids are not present ?!)
- Still the erasing of the metrics is an issue
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Race condition while updating single document multiple times
In this situation we are seeing race condition and always second line item attributes ... Most likely the issue is with your update...
Read more >Prevent MongoDB race condition in update query during ...
This works as expected. But I noticed when multiple requests are sent from the same user at the same time, the license_count is...
Read more >What is a Race Condition? - TechTarget
A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the...
Read more >https://11535138991246188740.googlegroups.com/atta...
... Fixed pencil rendering during the race condition between clear annotations and ... flag change inside the Audio Manager instead of using Mongo...
Read more >Race condition in monodb updates? - help - Meteor forums
I seem to have a problem with batch-updates that I run in the ... which should ensure that only available resources are taken...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Okay finally found it,
The person having the issue was using pymongo and not omniboard to delete the runs. At the same time, he did not delete the corresponding documents in the metrics collection. Hence the overlap of id on the metrics at some point.
Still, the fact that run[‘info’] is empty when the metrics documents already exist is weird.
Thx for your help all
@F-Barto: I doubt that that’s the issue. The metrics don’t have much to do with the
info
. See here: https://sacred.readthedocs.io/en/latest/collected_information.html#metrics-recordsYou can directly inspect the metrics for the problematic runs using pymongo. If you still see the same issue, the problem is not coming from Omniboard, and it is likely an ID issue (make sure that you’re using the latest Sacred). If your problem is indeed coming from Omniboard, it is probably best to open an issue in the corresponding repo.
@vnmabus: Yes that’s been on my TODO list, and I’ll check soon.