Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[doc] profiling NVMe and configuring `aio` param section

See original GitHub issue

Let’s use this issue to gather instructions on how to profile one’s CPU<->NVMe setup.

(@tjruwase and I have been editing this post)

You need to do this on every new CPU/NVMe setup in order to configure: aio param section.

The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.

For this demonstration we are going to use:

XPG Gammix s11 pro 2tb NVMe drive
Intel® Core™ i7-8700 CPU @ 3.20GHz setup.

1. Preparation

cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed

You may have to also install libaio-dev if the Deepspeed NVMe driver fails to build. On Ubuntu it’s just:

apt install libaio-dev

Depending on the speed of your NVMe, each benchmark could run for 30min or longer.

Important: make sure you’re not doing any other I/O on the device you’re testing or you will get incorrect results.

2. Run Read Benchmark

cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1

This benchmark assumes the current working directory is on the NVMe drive. If it’s not, copy the csrc/aio/py_test folder to your NVMe drive and run the test there.

You can, of course, use it to test non-NVMe drivers (e.g. SSD).

The tail of the list should show the fastest speeds.

Here is the best result for the read benchmark:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

3. Run Write Benchmark

# cd csrc/aio/py_test
mkdir write-test-data
mkdir write-logs
./run_write_sweep.sh 400 write-test-data write-logs
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1

The write report best result:

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324

4. Contribute your data

We need more read/write data for various devices to figure out how to make the configuration process automated.

If you’re contributing your data, please post:

Your NVMe device name/size
advertised max read/write spec (google: “device name spec”)
the results for the last 10 lines i.e.:

python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10

Important: please make sure not to do any other I/O on the device under benchmark.

5. Derive the `aio` params block

Now we need to figure out how to use the results of the benchmark to configure aio.

Here is the final result:

            "aio": {
                "block_size": 262144,
                "queue_depth": 32,
                "thread_count": 1,
                "single_submit": false,
                "overlap_events": true
            }

Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)

Schema of each line in results is as follows:

read: read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
write: it’s the same as read, plus 2nd column is the size of the written data.

The best read config was:

('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208

which corresponds to single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

single_submit=true if the 2nd column is single instead of block. overlap_events=false if the 3rd column is sequential instead of overlap.

The best write config was :

('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324,

which corresponds to: single_submit=false, overlap_events=true, queue_depth=32, block_size=262144

Unfortunately, users don’t currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.

Reasonable defaults are hard to set because of device and system differences. On many setups we tested block_size=1M had consistently seemed optimal across two clusters, but in this particular setup, block_size=256K seems to be optimal.

Finally, the last remaining config value is thread_count=1 is reasonable default, since this is per-rank configuration.

TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don’t agree.

Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:

max read speed of up to 3500 MB/s
max write speed of up to 3000 MB/s

The benchmark records throughput for ~400 different configuration combinations

read between 1.0-3.17 GB/s,
write between 1.2-2.59 GB/s and so now we can choose a single configuration that will lead to the highest throughput for read and write

I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.

TODO/Questions to @tjruwase:

[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That’s too much data for a user to make sense of. Perhaps it should just report and the min and max?

[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I’m thinking the numbers from the paper?

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:53 (52 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Apr 26, 2021

I added “4. Contribute your data” instructions to the OP - let’s see if we can get some contributions.

I made a call to community inviting to contribute: https://discuss.huggingface.co/t/deepspeed-zero-infinity-looking-for-nvme-device-benchmarks/5787

1reaction

tjruwasecommented, May 6, 2021

Awesome, I will get to work with the modifications. Thanks!

Top Results From Across the Web

DeepSpeed Configuration JSON

Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I ...

NVIDIA GPUDirect Storage Installation and Troubleshooting ...

Installing GPUDirect Storage. This section includes GDS installation, uninstallation, configuration information, and using experimental repos. 2.1. Before You ...

All-in-one Simplex R1.0 — StarlingX R5.0 documentation

This section describes how to interactively configure controller-0 to bootstrap the system with minimal critical data. Except where noted, you must execute all ......

Application Testing of NVMe over Fibre Channel with ...

Again, we used the HammerDB TPC-C profile but used the configuration parameters designed for Microsoft SQL Server. The test iterated through ...

Changelog - SPDK

Generation of UUIDs for NVMe bdevs may be enabled by running ... Their descriptions are available in JSON-RPC document, in section framework_set_scheduler.