question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to scale aim repo to support simultaneous writes from multiple training runs

See original GitHub issue

❓Question

in order to compare multiple training runs side by side, I tried make them write to the same aim repo located on a shared EFS. However when I do this, I see a huge number of error logs like below printed in some of those runs:

Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'

My questions are:

  • Does aim repo current support simultaneous writes from multiple training runs?
  • If yes, how scalable is that? In my case I encountered these errors when just running 3 simultaneous jobs
  • If not, is there a plan to support that?
  • Could this be related to that fact that I’m using EFS to store the repo? If so, are there better alternatives?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jiyuanqcommented, Aug 29, 2022

To share more information:

  1. The two jobs started encountering the issue at around the same time
  2. While training logs stopped updating at minibatch 24500-ish, it looks like aim metrics were updated to 40k minibatches, with a weird pattern (see below), so i guess training actually continued for a while it’s just no more logs where printed? 截屏2022-08-29 上午11 13 36
  3. The EFS I’m using for aim is also used by other use cases, and I found that it’s network IO usage can sometimes reach almost 100%, but i’m not sure how that may impact aim writes

I think what I can try on my end is to see if switching to another EFS with more network bandwidth would help. At the same time I also hope aim can provide a robust way of writing experiment logs

0reactions
jiyuanqcommented, Aug 30, 2022

Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue… I’ll try remote tracker as well

@jiyuanq, are those two runs started from scratch or they are the same which were failing before?

I’m running experiments for hyperparameter tuning, so all the 7 runs are similar to the failed ones except some hyperparameter values. (eg. activation function)

Read more comments on GitHub >

github_iconTop Results From Across the Web

aim-spacy/README.md at master · aimhubio/aim-spacy · GitHub
Aim -spaCy is an Aim-based spaCy experiment tracker. Its mission is to help AI researchers compare their spaCy metadata dramatically faster at scale....
Read more >
How to Organize Deep Learning Projects - Examples of Best ...
How to Organize Deep Learning Projects – Examples of Best Practices · Lifecycle · Define the task · Data collection · Model training...
Read more >
Scaling to over 1,000,000 requests per second - YouTube
Featuring: Beth Logan, Senior Director of Optimization at DataXuDescription: DataXu's decisioning technology handles over 1000000 ad ...
Read more >
Training multiple machine learning models and running data ...
In this article, I will show how we can make use of Apache Hadoop YARN to launch and monitor multiple jobs in a...
Read more >
Release 3.13.1 Gev Sogomonian, Gor Arakelyan et al. - Aim
Aim repository is the space where all your training runs are logged. ... Enjoy using aim to track xgboost experimental data which requires...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found