Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Remote read performance improvement in distributed training of millions of files

See original GitHub issue

Summary Use Nvidia Data loading library DALI https://developer.nvidia.com/dali for distributed data loading. Nvidia Data Loading Library DALI provides built in data loaders and data iterators. DALI supports overlapping training and data pre-processing to reduce latency and training time. It can be used as a benchmarking to test the data loading speed from Alluxio storage in AI workloads.

Workloads

Launch AWS EKS cluster

EKS cluster, 1 master and 6 workers, r5.8xlarge (32 vCPU, 256GB memory)
512 gp2 SSD for worker nodes

Launch Alluxio cluster

All the workers are SSD only with Worker embedded Fuse.
Root UFS is a s3 bucket with all imagenet original data, a little modification use the script https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
1000 folders * 1300 imgs/folder * 105KB on average each file (range from 30k to 200k), total 140GB data

Use Arena https://github.com/kubeflow/arena to launch the data loading script in FOUR OUT OF SIX nodes running AlluxioWorker

Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way

Data loading script Our benchmarking code is modified from Nvidia DALI example script given by the DALI tutorial ImageNet Training in PyTorch. The original script supports reading from the imagenet original dataset, do data loading, data preprocessing, and data iteration with DALI, train the model with pyTorch resnet models.

Our modifications include

Remove the training logic from the script, only include the DALI data loader logics.
Change DALI to use CPU only instead of using GPU or MIXED devices.
Support multi-process data loading in each node

Our script can:

Support running DALI data loader from multiple nodes, multiple data loading threads in each node
Record the images loaded number and data loading time of each node and calculate the image/s/node.

Launch arena job

arena --loglevel info submit pytorch \
--name=test-job \
--gpus=0 \
--workers=4 \
--cpu 4 \
--memory 32G \
--selector alluxio-master=false \
--image=nvcr.io/nvidia/pytorch:21.05-py3 \
--data-dir=/alluxio/ \
--sync-mode=git \
--sync-source=https://github.com/LuQQiu/DALI.git \
"python /root/code/DALI/docs/examples/use_cases/pytorch/resnet50/main.py \
--process 4 \
--batch-size 256 \
--print-freq 1000 \
/alluxio/alluxio-mountpoint/alluxio-fuse/dali"

$  arena list test-job                                                    ok
NAME      STATUS     TRAINER     AGE  NODE
test-job  SUCCEEDED  PYTORCHJOB  33m  N/A
$ kubectl get pods
NAME                READY   STATUS      RESTARTS   AGE
test-job-master-0   0/1     Completed   0          34m
test-job-worker-0   0/1     Completed   0          34m
$ kubectl logs test-job-master-0
$ kubectl logs test-job-worker-0

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

3reactions

LuQQiucommented, Sep 6, 2021

Experiment 6 Question 1: Why remote read is the performance bottleneck? How to improve remote read performance? Question 2: What are the differences between DALI env and Stressbench env that could result in the diff?

The hypothesis from observation:

CPU resources. DALI env worker only has 1 CPU (configured by kubectl yaml scripts）which stressbench env worker can use up to 16 CPU
ML read pattern. Stressbench open()-read()-close(), only 4 files are open() at the same time while ML env more than 1500 files are opened at the same time.

Experiment

Enlarge CPU resources
Close gRPC stream immediately after reading all the data

Results: Both hypotheses are performance bottlenecks. With the combined effort, we increase performance from 2k img/s to more than 13k img/s.

Enlarge CPU CPU resources needed for training <google-sheets-html-origin><style type="text/css"></style>

Node CPU	Data Loading Process	Data Loading CPU	Arena Memory (GB)	Alluxio Worker CPU	Alluxio Worker Memory (GB)	Throughput
32	4	4	32	3	16	4723 img/s
32	8	10	32	4	16	9425 img/s
32	12	12	32	6	16	12449 img/s
32	16	20	32	10	32	13725 img/s

Close grpc stream immediately after reading all the data belongs to a file https://github.com/Alluxio/alluxio/issues/14020
Small improvement: Avoid unneeded async cache RPCs by setting alluxio.user.file.readtype.default=NO_CACHE. Async cache will be issued when reading data from remote worker, but because this is a single tier storage and enough space, async cache manager will not do anything

1reaction

Binyang2014commented, Sep 7, 2021

Can we provide the detail Alluxio configurations for this experiments? We do want to reproduce the results in our test bed.

Top Results From Across the Web

When AI Meets Alluxio at Bilibili | Building an Efficient AI ...

The training data is downloaded through the container and read locally during the training ... Alluxio itself is a distributed file system.

Scaling deep learning workloads with PyTorch / XLA and ...

Streaming training data from remote storage to accelerators can alleviate these issues, but it introduces a host of new challenges: Network ...

Distributed training with Amazon EKS and Torch ... - AWS

We use the PyTorch DistributedDataParallel API and the Kubernetes TorchElastic controller, and run our training jobs on an EKS cluster ...

Multi-GPU and distributed training - Keras

This can improve performance when: Your data is not expected to change from iteration to iteration; You are reading data from a remote...

Ray: A Framework for Scaling and Distributing Python & ML ...

Recording of a live meetup on Feb 16, 2022 from our friends at Data + AI Denver/Boulder meetup group. Meetup details:Our first talk...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Remote read performance improvement in distributed training of millions of files

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ERROR AlluxioWorker - Fatal error: Failed to create worker process #14153

Allow running stress master bench without mounting a UFS