Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to scale aim repo to support simultaneous writes from multiple training runs

See original GitHub issue

❓Question

in order to compare multiple training runs side by side, I tried make them write to the same aim repo located on a shared EFS. However when I do this, I see a huge number of error logs like below printed in some of those runs:

Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'
Exception ignored in: 'aimrocks.lib_rocksdb.DB.write'
Traceback (most recent call last):
  File "src/aimrocks/lib_rocksdb.pyx", line 89, in aimrocks.lib_rocksdb.check_status
aimrocks.errors.RocksIOError: b'IO error: While appending to file: /mnt/model_factory_pipeline_data/experiment_tracking/aim/.aim/seqs/chunks/fa74e34e388248a68339f974/000012.log: Bad file descriptor'

My questions are:

Does aim repo current support simultaneous writes from multiple training runs?
If yes, how scalable is that? In my case I encountered these errors when just running 3 simultaneous jobs
If not, is there a plan to support that?
Could this be related to that fact that I’m using EFS to store the repo? If so, are there better alternatives?

Issue Analytics

State:
Created a year ago
Comments:11 (4 by maintainers)

Top GitHub Comments

1reaction

jiyuanqcommented, Aug 29, 2022

To share more information:

The two jobs started encountering the issue at around the same time
While training logs stopped updating at minibatch 24500-ish, it looks like aim metrics were updated to 40k minibatches, with a weird pattern (see below), so i guess training actually continued for a while it’s just no more logs where printed?
The EFS I’m using for aim is also used by other use cases, and I found that it’s network IO usage can sometimes reach almost 100%, but i’m not sure how that may impact aim writes

I think what I can try on my end is to see if switching to another EFS with more network bandwidth would help. At the same time I also hope aim can provide a robust way of writing experiment logs

0reactions

jiyuanqcommented, Aug 30, 2022

Update: I switched to a new EFS with no other users, and tried 7 runs today. 2 of those runs still hit the same issue… I’ll try remote tracker as well

@jiyuanq, are those two runs started from scratch or they are the same which were failing before?

I’m running experiments for hyperparameter tuning, so all the 7 runs are similar to the failed ones except some hyperparameter values. (eg. activation function)

Top Results From Across the Web

aim-spacy/README.md at master · aimhubio/aim-spacy · GitHub

Aim -spaCy is an Aim-based spaCy experiment tracker. Its mission is to help AI researchers compare their spaCy metadata dramatically faster at scale....

How to Organize Deep Learning Projects - Examples of Best ...

How to Organize Deep Learning Projects – Examples of Best Practices · Lifecycle · Define the task · Data collection · Model training...

Scaling to over 1,000,000 requests per second - YouTube

Featuring: Beth Logan, Senior Director of Optimization at DataXuDescription: DataXu's decisioning technology handles over 1000000 ad ...

Training multiple machine learning models and running data ...

In this article, I will show how we can make use of Apache Hadoop YARN to launch and monitor multiple jobs in a...

Release 3.13.1 Gev Sogomonian, Gor Arakelyan et al. - Aim

Aim repository is the space where all your training runs are logged. ... Enjoy using aim to track xgboost experimental data which requires...