Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

VERY slow large blob downloads

See original GitHub issue

I am confused about how to optimize BlobClient for downloading large blobs (up to 100 GB).

For example, on a ~480 MB blob the following code takes around 4 minutes to execute:

full_path_to_file = '{}/{}'.format(staging_path,blob_name)
blob = BlobClient.from_connection_string(conn_str=connection_string, container_name=container_name, blob_name=blob_name)
with open(full_path_to_file, "wb") as my_blob:
    download_stream = blob.download_blob()
    result = my_blob.write(download_stream.readall())

In the previous version of the SDK I was able to specify a max_connections parameter that sped download significantly. This appears to have been removed (along with progress callbacks, which is annoying). I have files upwards of 99 GB which will take almost 13 hours to download at this rate, whereas I used to be able to download similar files in under two hours.

How can I optimize the download of large blobs?

Thank you!

Edit: I meant that it took 4 minutes to download a 480 megabyte file. Also, I am getting memory errors when trying to download larger files (~40 GB).

Issue Analytics

State:
Created 3 years ago
Comments:23 (7 by maintainers)

Top GitHub Comments

4reactions

mockodincommented, Apr 5, 2020

I experienced timeouts on larger downloads as well >100GB commonly and >200GB would always fail, when using .readall(), more on that below. Of note, max_concurrency did NOT resolve this for me. For me it seems that the Auth header timestamp got older than the accepted 25 minute age limit. So the client isn’t updating the header automatically. I was able to work around it, in a ugly manner.

Download in 1GB Range-Based Chunking download_blob(offset=start, length=end).download_to_stream(MemBlob, max_concurrency=12)
Overwrite the retry settings, BlobServiceClient.from_connection_string(<here>), immediately fail (might be the cause of the timeout to begin with)
Validate the segment size is the size received
If an exception is thrown or the segment not the expected size (last segment will be almost always be smaller of course) then reauth and retry the last segment again

Rinse and repeat till the download completes. Note I build a checksum as I download since I know the checksum of the original file so I have high confidence of file integrity and validate at the end. Performance wise on a 1Gbps link for a single blob out of cool storage I get ~430Mbps / 53.75MB/s. Azure side cool tier is 60MB/s limit or there about so it seems to work pretty well.

0reactions

delahondescommented, Aug 22, 2022

Building on @mockodin fine remarks I implemented a file like object on top of blob object, and I was very successful (it does not the reauth trick he mentionned because I did not need that), the downloading speed was enhanced maybe ten times when using this iterator vs the one included in the SDK, many thanks to you, @mockodin !

class ObjectFile:
    """An ObjectFile in object storage that can be opened and closed.
    See Objects.open()"""
    def __init__(self, name, client,mode, size):
        """Initialize the Object object with a name and a blob_client
        mode is w or r, size is the blob size.
        """
        self.name = name
        self.client = client
        self.block_list = []
        self.mode=mode
        self.__open__=True
        if mode=='r':
            self.write = forbid('write', 'r')
        elif mode=='w':
            self.__iter__ = forbid('__iter__', 'w')
            self.read = forbid('read', 'w')
        self.pos = 0
        self.size = size


    def write(self, chunk):
        """Write a chunk of data (a part of the data) into the object"""
        block_id = str(uuid.uuid4())
        self.client.stage_block(block_id=block_id, data=chunk)
        self.block_list.append(BlobBlock(block_id=block_id))
    
    def close(self):
        """Finalise the object"""
        if self.mode=='w':
            self.client.commit_block_list(self.block_list)
        self.__open__=False

    def __del__(self):
        if self.__open__:
            self.close()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self.__open__:
            self.close()

    def __iter__(self):
        self.pos=0

        #stream = self.client.download_blob(max_concurrency=10)
        return self

    def __next__(self):
        data = BytesIO()
        if self.pos>=self.size:
            raise StopIteration()
        elif self.pos+CHUNK_SIZE>self.size:
            size=self.size-self.pos
        else:
            size=CHUNK_SIZE
        self.client.download_blob(offset=self.pos, length=size
            ).download_to_stream(data, max_concurrency=12)
        self.pos += size
        return data.getvalue()
        
    def read(self, size=None):
        if size is None:
            return self.client.download_blob().readall()
        else:
            if self.pos>=self.size:
                return ''
            elif self.pos+size>self.size:
                size=self.size-self.pos
            data = BytesIO()
            self.client.download_blob(offset=self.pos, length=size
                ).download_to_stream(data, max_concurrency=12)
            self.pos += size
            return data.getvalue()