How to scale aim repo to support simultaneous writes from multiple training runs
See original GitHub issue❓Question
in order to compare multiple training runs side by side, I tried make them write to the same aim repo located on a shared EFS. However when I do this, I see a huge number of error logs like below printed in some of those runs:
Traceback (most recent call last):
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
My questions are:
- Does aim repo current support simultaneous writes from multiple training runs?
- If yes, how scalable is that? In my case I encountered these errors when just running 3 simultaneous jobs
- If not, is there a plan to support that?
- Could this be related to that fact that I’m using EFS to store the repo? If so, are there better alternatives?
Issue Analytics
- State:
- Created a year ago
- Comments:11 (4 by maintainers)
Top Results From Across the Web
aim-spacy/README.md at master · aimhubio/aim-spacy · GitHub
Aim -spaCy is an Aim-based spaCy experiment tracker. Its mission is to help AI researchers compare their spaCy metadata dramatically faster at scale....
Read more >How to Organize Deep Learning Projects - Examples of Best ...
How to Organize Deep Learning Projects – Examples of Best Practices · Lifecycle · Define the task · Data collection · Model training...
Read more >Scaling to over 1,000,000 requests per second - YouTube
Featuring: Beth Logan, Senior Director of Optimization at DataXuDescription: DataXu's decisioning technology handles over 1000000 ad ...
Read more >Training multiple machine learning models and running data ...
In this article, I will show how we can make use of Apache Hadoop YARN to launch and monitor multiple jobs in a...
Read more >Release 3.13.1 Gev Sogomonian, Gor Arakelyan et al. - Aim
Aim repository is the space where all your training runs are logged. ... Enjoy using aim to track xgboost experimental data which requires...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
To share more information:
I think what I can try on my end is to see if switching to another EFS with more network bandwidth would help. At the same time I also hope aim can provide a robust way of writing experiment logs
I’m running experiments for hyperparameter tuning, so all the 7 runs are similar to the failed ones except some hyperparameter values. (eg. activation function)