Remote read performance improvement in distributed training of millions of files
See original GitHub issueSummary Use Nvidia Data loading library DALI https://developer.nvidia.com/dali for distributed data loading. Nvidia Data Loading Library DALI provides built in data loaders and data iterators. DALI supports overlapping training and data pre-processing to reduce latency and training time. It can be used as a benchmarking to test the data loading speed from Alluxio storage in AI workloads.
Workloads
- Launch AWS EKS cluster
EKS cluster, 1 master and 6 workers, r5.8xlarge (32 vCPU, 256GB memory)
512 gp2 SSD for worker nodes
- Launch Alluxio cluster
All the workers are SSD only with Worker embedded Fuse.
Root UFS is a s3 bucket with all imagenet original data, a little modification use the script https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
1000 folders * 1300 imgs/folder * 105KB on average each file (range from 30k to 200k), total 140GB data
- Use Arena https://github.com/kubeflow/arena to launch the data loading script in FOUR OUT OF SIX nodes running AlluxioWorker
Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way
Data loading script Our benchmarking code is modified from Nvidia DALI example script given by the DALI tutorial ImageNet Training in PyTorch. The original script supports reading from the imagenet original dataset, do data loading, data preprocessing, and data iteration with DALI, train the model with pyTorch resnet models.
Our modifications include
- Remove the training logic from the script, only include the DALI data loader logics.
- Change DALI to use CPU only instead of using GPU or MIXED devices.
- Support multi-process data loading in each node
Our script can:
- Support running DALI data loader from multiple nodes, multiple data loading threads in each node
- Record the images loaded number and data loading time of each node and calculate the image/s/node.
Launch arena job
arena --loglevel info submit pytorch \
--name=test-job \
--gpus=0 \
--workers=4 \
--cpu 4 \
--memory 32G \
--selector alluxio-master=false \
--image=nvcr.io/nvidia/pytorch:21.05-py3 \
--data-dir=/alluxio/ \
--sync-mode=git \
--sync-source=https://github.com/LuQQiu/DALI.git \
"python /root/code/DALI/docs/examples/use_cases/pytorch/resnet50/main.py \
--process 4 \
--batch-size 256 \
--print-freq 1000 \
/alluxio/alluxio-mountpoint/alluxio-fuse/dali"
$ arena list test-job ok
NAME STATUS TRAINER AGE NODE
test-job SUCCEEDED PYTORCHJOB 33m N/A
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-job-master-0 0/1 Completed 0 34m
test-job-worker-0 0/1 Completed 0 34m
$ kubectl logs test-job-master-0
$ kubectl logs test-job-worker-0
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Experiment 6 Question 1: Why remote read is the performance bottleneck? How to improve remote read performance? Question 2: What are the differences between DALI env and Stressbench env that could result in the diff?
The hypothesis from observation:
Experiment
Results: Both hypotheses are performance bottlenecks. With the combined effort, we increase performance from 2k img/s to more than 13k img/s.
alluxio.user.file.readtype.default=NO_CACHE
. Async cache will be issued when reading data from remote worker, but because this is a single tier storage and enough space, async cache manager will not do anythingCan we provide the detail Alluxio configurations for this experiments? We do want to reproduce the results in our test bed.