'az storage blob download-batch' extremely slow (40x slower than azcopy)
See original GitHub issue
az feedback
auto-generates most of the information requested below, as of CLI version 2.0.62
Describe the bug
I have been using the AzureCLI pipeline task to download files from a Data Lake with the az
tool because the AzureFileCopy task isn’t available on Linux (only on Windows).
Because there is no support to supply multiple patterns to az storage blob batch-download
I need multiple invocations, additionally slowing down the download due to https://github.com/Azure/azure-cli/issues/9444.
I will give my full example of using the az
tool in my pipeline:
parameters:
- name: file_patterns
type: object
default:
- rf/*c_*b.nc
- calibration/left.nc
- calibration/right.nc
- phase_two/*/left.nc
- phase_two/*/right.nc
jobs:
- job: "download files"
pool: my-vmss-agents-pool
steps:
- ${{ each file_pattern in parameters.file_patterns }}:
- task: AzureCLI@1
displayName: Copy ${{ file_pattern }} from Data Lake
inputs:
scriptType: bash
azureSubscription: my-vmssagents-service-connection
scriptLocation: inlineScript
inlineScript: |
az storage blob download-batch \
--source "data" \
--account-name "data" \
--max-connections=6 \
--destination "$(Build.Repository.LocalPath)" \
--pattern "${{ file_pattern }}"
The complete time to download the 5.7 GB distributed over 8 files is 492 seconds.
While using azcopy copy
(for which I, unfortunately, need some tricky boilerplate to authenticate) it takes 13 seconds to download the exact same!
I have verified that the azcopy
tool downloads the exact same files.
For completeness, I used this single pipeline task (because azcopy
allows for multiple patterns)
- task: AzureCLI@2
displayName: Download using azcopy
inputs:
azureSubscription: my-vmssagents-service-connection
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
export STORE_NAME="data"
export CONTAINER_NAME="data"
NOW=`date +"%Y-%m-%dT%H:%M:00Z"` \
EXPIRY=`date -d "$NOW + 1 day" +"%Y-%m-%dT%H:%M:00Z"` \
&& export SAS_TOKEN=$( az storage container generate-sas \
--account-name $STORE_NAME \
--name $CONTAINER_NAME \
--start $NOW \
--expiry $EXPIRY \
--permissions acdlrw \
--output tsv )
$(Agent.ToolsDirectory)/azcopy/azcopy copy \
"https://${STORE_NAME}.blob.core.windows.net/${CONTAINER_NAME}/${{ parameters.folder }}/?${SAS_TOKEN}" \
"." --recursive --include-pattern "*c_*b.nc;left.nc;right.nc"
This is a screenshot of the different tasks of a pipeline that ran, which includes the times that it took to complete each task.
To Reproduce Create a bunch of files on a storage container and download them.
Expected behavior
That az storage blob download-batch
has a similar performance to azcopy
.
Environment summary
The Pipelines Agents run on Standard_E8s_v3
Azure VMs with the following cloud-init
file:
#cloud-config
package_update: true
packages:
- gcc
- git-lfs
- git
runcmd:
- export MINICONDA_VERSION=4.8.2
- export CONDA_VERSION=4.8.2
- wget --quiet https://repo.continuum.io/miniconda/Miniconda3-py37_${MINICONDA_VERSION}-Linux-x86_64.sh
- /bin/bash Miniconda3-py37_${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p /opt/conda
- rm Miniconda3-py37_${MINICONDA_VERSION}-Linux-x86_64.sh
- echo ". /opt/conda/etc/profile.d/conda.sh" >> /home/AzDevOps/.bashrc
- /opt/conda/bin/conda config --system --prepend channels conda-forge
- /opt/conda/bin/conda install python=3.8 netcdf4 numpy pandas xarray tqdm dask qcodes
- chmod 777 -R /opt/conda
- curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
When logging into an initialized VM:
a-banijh@banijh-vm00004I:/agent/_work/1/s$ az --version
azure-cli 2.9.0
command-modules-nspkg 2.0.3
core 2.9.0
nspkg 3.0.4
telemetry 1.0.4
Python location '/opt/az/bin/python3'
Extensions directory '/home/a-banijh/.azure/cliextensions'
Python (Linux) 3.6.10 (default, Jul 10 2020, 07:17:28)
[GCC 7.5.0]
Legal docs and information: aka.ms/AzureCliLegal
Your CLI is up-to-date.
Additional context
Other people also report the same issue with az storage file download-batch
on StackOverflow (I assume it’s using the same functions).
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Hi @bashnijholt, for the issue we will recommend you to use
az storage copy
if you want to have better performance. As we know, Azcopy has very good performance and CLI also wants to utilize azcopy work due to limited bandwidth. Currently we are keeping working on azcopy integration and will do more with fixing issue https://github.com/Azure/azure-cli/issues/10741.If you have any other concern, feel free to let me know.
How is this issue solved exactly? Could you please point me to the relevant PRs/commits?