question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

VERY slow large blob downloads

See original GitHub issue

I am confused about how to optimize BlobClient for downloading large blobs (up to 100 GB).

For example, on a ~480 MB blob the following code takes around 4 minutes to execute:

full_path_to_file = '{}/{}'.format(staging_path,blob_name)
blob = BlobClient.from_connection_string(conn_str=connection_string, container_name=container_name, blob_name=blob_name)
with open(full_path_to_file, "wb") as my_blob:
    download_stream = blob.download_blob()
    result = my_blob.write(download_stream.readall())

In the previous version of the SDK I was able to specify a max_connections parameter that sped download significantly. This appears to have been removed (along with progress callbacks, which is annoying). I have files upwards of 99 GB which will take almost 13 hours to download at this rate, whereas I used to be able to download similar files in under two hours.

How can I optimize the download of large blobs?

Thank you!

Edit: I meant that it took 4 minutes to download a 480 megabyte file. Also, I am getting memory errors when trying to download larger files (~40 GB).

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:23 (7 by maintainers)

github_iconTop GitHub Comments

4reactions
mockodincommented, Apr 5, 2020

I experienced timeouts on larger downloads as well >100GB commonly and >200GB would always fail, when using .readall(), more on that below. Of note, max_concurrency did NOT resolve this for me. For me it seems that the Auth header timestamp got older than the accepted 25 minute age limit. So the client isn’t updating the header automatically. I was able to work around it, in a ugly manner.

  1. Download in 1GB Range-Based Chunking download_blob(offset=start, length=end).download_to_stream(MemBlob, max_concurrency=12)
  2. Overwrite the retry settings, BlobServiceClient.from_connection_string(<here>), immediately fail (might be the cause of the timeout to begin with)
  3. Validate the segment size is the size received
  4. If an exception is thrown or the segment not the expected size (last segment will be almost always be smaller of course) then reauth and retry the last segment again

Rinse and repeat till the download completes. Note I build a checksum as I download since I know the checksum of the original file so I have high confidence of file integrity and validate at the end. Performance wise on a 1Gbps link for a single blob out of cool storage I get ~430Mbps / 53.75MB/s. Azure side cool tier is 60MB/s limit or there about so it seems to work pretty well.

0reactions
delahondescommented, Aug 22, 2022

Building on @mockodin fine remarks I implemented a file like object on top of blob object, and I was very successful (it does not the reauth trick he mentionned because I did not need that), the downloading speed was enhanced maybe ten times when using this iterator vs the one included in the SDK, many thanks to you, @mockodin !

class ObjectFile:
    """An ObjectFile in object storage that can be opened and closed.
    See Objects.open()"""
    def __init__(self, name, client,mode, size):
        """Initialize the Object object with a name and a blob_client
        mode is w or r, size is the blob size.
        """
        self.name = name
        self.client = client
        self.block_list = []
        self.mode=mode
        self.__open__=True
        if mode=='r':
            self.write = forbid('write', 'r')
        elif mode=='w':
            self.__iter__ = forbid('__iter__', 'w')
            self.read = forbid('read', 'w')
        self.pos = 0
        self.size = size


    def write(self, chunk):
        """Write a chunk of data (a part of the data) into the object"""
        block_id = str(uuid.uuid4())
        self.client.stage_block(block_id=block_id, data=chunk)
        self.block_list.append(BlobBlock(block_id=block_id))
    
    def close(self):
        """Finalise the object"""
        if self.mode=='w':
            self.client.commit_block_list(self.block_list)
        self.__open__=False

    def __del__(self):
        if self.__open__:
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.__open__:
            self.close()

    def __iter__(self):
        self.pos=0

        #stream = self.client.download_blob(max_concurrency=10)
        return self

    def __next__(self):
        data = BytesIO()
        if self.pos>=self.size:
            raise StopIteration()
        elif self.pos+CHUNK_SIZE>self.size:
            size=self.size-self.pos
        else:
            size=CHUNK_SIZE
        self.client.download_blob(offset=self.pos, length=size
            ).download_to_stream(data, max_concurrency=12)
        self.pos += size
        return data.getvalue()
        
    def read(self, size=None):
        if size is None:
            return self.client.download_blob().readall()
        else:
            if self.pos>=self.size:
                return ''
            elif self.pos+size>self.size:
                size=self.size-self.pos
            data = BytesIO()
            self.client.download_blob(offset=self.pos, length=size
                ).download_to_stream(data, max_concurrency=12)
            self.pos += size
            return data.getvalue()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting very slow speed while downloading storage blob.
Hello, I am getting very slow speed while downloading the storage blob from Azure storage account. I am using .net core 3.1 web...
Read more >
Azure BLOB Download is very slow\inconsistent from a Linux ...
I was getting reports that "downloads from Azure are slow" from one of our datacenters, so to recreate I've hosted my own BLOB...
Read more >
C# Azure Blob Client Download Large blob is slow
NET 4.8 using the Azure Blob client, latest version. For some regions, we are seeing download speeds in C# of as slow as...
Read more >
Slow and stalled blob downloads : r/AZURE - Reddit
I'm trying to download several large blobs from a container using the new ... First, the downloads are slow - at least 3-4...
Read more >
Do's and Don'ts for Streaming File Uploads to Azure Blob ...
NET MVC for uploading large files to Azure Blob Storage ... This is slow and it is wasteful if all we want to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found