question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Remote read performance improvement in distributed training of millions of files

See original GitHub issue

Summary Use Nvidia Data loading library DALI https://developer.nvidia.com/dali for distributed data loading. Nvidia Data Loading Library DALI provides built in data loaders and data iterators. DALI supports overlapping training and data pre-processing to reduce latency and training time. It can be used as a benchmarking to test the data loading speed from Alluxio storage in AI workloads.

Workloads

  • Launch AWS EKS cluster
EKS cluster, 1 master and 6 workers, r5.8xlarge (32 vCPU, 256GB memory)
512 gp2 SSD for worker nodes
  • Launch Alluxio cluster
All the workers are SSD only with Worker embedded Fuse.
Root UFS is a s3 bucket with all imagenet original data, a little modification use the script https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
1000 folders * 1300 imgs/folder * 105KB on average each file (range from 30k to 200k), total 140GB data
Arena is a command-line interface for the data scientists to run and monitor the machine learning training jobs and check their results in an easy way

Data loading script Our benchmarking code is modified from Nvidia DALI example script given by the DALI tutorial ImageNet Training in PyTorch. The original script supports reading from the imagenet original dataset, do data loading, data preprocessing, and data iteration with DALI, train the model with pyTorch resnet models.

Our modifications include

  • Remove the training logic from the script, only include the DALI data loader logics.
  • Change DALI to use CPU only instead of using GPU or MIXED devices.
  • Support multi-process data loading in each node

Our script can:

  • Support running DALI data loader from multiple nodes, multiple data loading threads in each node
  • Record the images loaded number and data loading time of each node and calculate the image/s/node.

Launch arena job

arena --loglevel info submit pytorch \
--name=test-job \
--gpus=0 \
--workers=4 \
--cpu 4 \
--memory 32G \
--selector alluxio-master=false \
--image=nvcr.io/nvidia/pytorch:21.05-py3 \
--data-dir=/alluxio/ \
--sync-mode=git \
--sync-source=https://github.com/LuQQiu/DALI.git \
"python /root/code/DALI/docs/examples/use_cases/pytorch/resnet50/main.py \
--process 4 \
--batch-size 256 \
--print-freq 1000 \
/alluxio/alluxio-mountpoint/alluxio-fuse/dali"

$  arena list test-job                                                    ok
NAME      STATUS     TRAINER     AGE  NODE
test-job  SUCCEEDED  PYTORCHJOB  33m  N/A
$ kubectl get pods
NAME                READY   STATUS      RESTARTS   AGE
test-job-master-0   0/1     Completed   0          34m
test-job-worker-0   0/1     Completed   0          34m
$ kubectl logs test-job-master-0
$ kubectl logs test-job-worker-0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

3reactions
LuQQiucommented, Sep 6, 2021

Experiment 6 Question 1: Why remote read is the performance bottleneck? How to improve remote read performance? Question 2: What are the differences between DALI env and Stressbench env that could result in the diff?

The hypothesis from observation:

  • CPU resources. DALI env worker only has 1 CPU (configured by kubectl yaml scripts)which stressbench env worker can use up to 16 CPU
  • ML read pattern. Stressbench open()-read()-close(), only 4 files are open() at the same time while ML env more than 1500 files are opened at the same time.

Experiment

  • Enlarge CPU resources
  • Close gRPC stream immediately after reading all the data

Results: Both hypotheses are performance bottlenecks. With the combined effort, we increase performance from 2k img/s to more than 13k img/s.

  • Enlarge CPU CPU resources needed for training <google-sheets-html-origin><style type="text/css"></style>
Node CPU Data Loading Process Data Loading CPU Arena Memory (GB) Alluxio Worker CPU Alluxio Worker Memory (GB) Throughput
32 4 4 32 3 16 4723 img/s
32 8 10 32 4 16 9425 img/s
32 12 12 32 6 16 12449 img/s
32 16 20 32 10 32 13725 img/s
  • Close grpc stream immediately after reading all the data belongs to a file https://github.com/Alluxio/alluxio/issues/14020
  • Small improvement: Avoid unneeded async cache RPCs by setting alluxio.user.file.readtype.default=NO_CACHE. Async cache will be issued when reading data from remote worker, but because this is a single tier storage and enough space, async cache manager will not do anything
1reaction
Binyang2014commented, Sep 7, 2021

Can we provide the detail Alluxio configurations for this experiments? We do want to reproduce the results in our test bed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When AI Meets Alluxio at Bilibili | Building an Efficient AI ...
The training data is downloaded through the container and read locally during the training ... Alluxio itself is a distributed file system.
Read more >
Scaling deep learning workloads with PyTorch / XLA and ...
Streaming training data from remote storage to accelerators can alleviate these issues, but it introduces a host of new challenges: Network ...
Read more >
Distributed training with Amazon EKS and Torch ... - AWS
We use the PyTorch DistributedDataParallel API and the Kubernetes TorchElastic controller, and run our training jobs on an EKS cluster ...
Read more >
Multi-GPU and distributed training - Keras
This can improve performance when: Your data is not expected to change from iteration to iteration; You are reading data from a remote...
Read more >
Ray: A Framework for Scaling and Distributing Python & ML ...
Recording of a live meetup on Feb 16, 2022 from our friends at Data + AI Denver/Boulder meetup group. Meetup details:Our first talk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found