question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance worsens greatly when downloading blobs with long breaks in between

See original GitHub issue
  • Package Name: azure-storage-blob
  • Package Version: 12.6.0 (but it also happens on 12.13.1)
  • Operating System: linux (but it doesn’t really matter, it happens on my mac too)
  • Python Version: 3.8

Describe the bug When using the asyncio SDK, if we wait more than 10 seconds between blob downloads the download time becomes 10-100 times slower.

To Reproduce I wrote this script to demonstrate the issue:

import random
from asyncio import gather, sleep, run
from time import perf_counter
from azure.storage.blob.aio import BlobServiceClient, ContainerClient

all_times = []

async def download(client: ContainerClient, path: str):
    s = perf_counter()
    downloader = await client.download_blob(path)
    await downloader.readall()
    e = (perf_counter() - s) * 1000
    all_times.append(e)
    print(f"download took {e}ms file={path}")

def get_url_and_key():
    ### censored for security reasons ###

async def do():
    account_url, account_key = get_url_and_key()
    blob_client = BlobServiceClient(account_url, account_key)
    container_client = blob_client.get_container_client('some-container')

    file_pool = []
    async for x in container_client.list_blobs():
        if len(file_pool) > 50:
            break
        file_pool.append(x.name)

    for i in range(30):
        files = set()
        while len(files) < 3:
            files.add(file_pool[random.randint(0, len(file_pool) - 1)])
        await gather(*[download(container_client, file) for file in files])
        await sleep(20)

    await container_client.close()
    await blob_client.close()

    print(f"average is {sum(all_times) / len(all_times)}ms")

run(do())

what this script essentially does:

  1. find the names of 50 blobs that exist in the container
  2. randomly choose 3 of them to download
  3. download them asynchronously (gather 3 tasks) and print the download times
  4. sleep a certain amount of seconds
  5. repeat (30 times) and print the average download time

if the sleep time is set to 10 seconds or below I get expectedly low download times:

download took 23.274756036698818ms file=+1b2+N3X7cgkoaajoUSlCA== download took 31.523440033197403ms file=+/K/vccgIKxIsFW6tk8l3Q== download took 44.199715834110975ms file=+/dVLONMzj3tMtql+Zv7Jg== download took 33.404441084712744ms file=+/UeV81yZRT+Eh5jeUKlOA== . . . download took 12.770875822752714ms file=+16GIyseiqsaO8T+mrXLQg== download took 9.113383013755083ms file=+1YF6MfP5Xb+sWEqgr36LQ== download took 7.825685199350119ms file=+/ZJM5EQ40nni2+zKjJA4A== download took 7.158686872571707ms file=+1b2+N3X7cgkoaajoUSlCA== average is 20.692046359181404ms

but if I set it to 20 seconds for example, the download times become insanely high (beyond the first batch):

download took 9.756594430655241ms file=+0huQr/6tn7zw5+p2fT/2A== download took 20.794587209820747ms file=+1b2+N3X7cgkoaajoUSlCA== download took 45.60287203639746ms file=+1h1V/YFgN130/9kzcIygQ== download took 885.5280890129507ms file=++nRJgH2HUF30/9kzcIygQ== download took 884.8122912459075ms file=+1YF6MfP5Xb+sWEqgr36LQ== download took 905.2660423330963ms file=+/soiSTB/5Wb0zHn28+O6Q== . . . download took 930.0287216901779ms file=+1ZAwh3CRGuuBvav8e/34w== download took 927.8017273172736ms file=+1QWh0iEdgVt/VX2PlPDpw== download took 940.912198740989ms file=+/FxOL5ySAQaO8T+mrXLQg== download took 639.4636919721961ms file=++O8gZmF/40FPYL5+VRhAA== download took 664.1455427743495ms file=+/soiSTB/5Wb0zHn28+O6Q== download took 669.6559321135283ms file=++zejjUIwmWuBvav8e/34w== average is 463.2759254503374ms

I’ve run this countless times with different sleep times on my own machine (macOS) and on the production machine (linux) at different traffic loads and with different storage accounts - and always got the same results, so we can rule out load issues on the azure blob storage, or file size variance.

It seems to me that after waiting a certain amount of time the connection with the azure blob storage dies and takes forever to revive. It seems like a bug to me, but if there’s a way to tweak that so the connection stays live for 3 minutes of inactivity for example (eliminating the need to revive it), it would solve the issue for now.

We’ve moved to the async SDK to be able to download many files asynchronously and reduce our response time, but it seems like the opposite has happened and it greatly increased it, to the point this SDK is unusable to us. Quick assistance will be greatly appreciated.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jalauzon-msftcommented, Oct 6, 2022

Hi @asafalon1, we have done some fairly extensive investigation into this issue, and we believe this is mainly being caused by TCP Connection re-use in the aiohttp layer.

When a TCP Connection is first opened to the server, it takes additional time to establish that connection. Additionally, if configured, an existing TCP Connection can be kept open (alive) for a period of time where the connection will not need to be re-established and can be re-used. For aiohttp, the HTTP library behind our async library, the default value for this this idle timeout is 15 seconds. So, when we sleep for 20 seconds, the TCP connection that was open by the first set of requests is closed and it takes time to reopen this request on the next set of downloads. Additionally, my guess for why this affects all executing threads is probably because there would be one connection per thread in this case.

What we don’t fully understand yet is why the additional download time is not fixed and seems to scale with the blob size. The time taken to establish a TCP connection should be fixed but it seems there is something additional happening in aiohttp causing this additional time to scale with the response size.

The good news is that this idle timeout is configurable on the aiohttp side and we have found that increasing it beyond how long you are sleeping, seems to fix the issue. Currently the best way to configure this would be to provide and configure your own ClientSession object to the BlobServiceClient. This can be done like this:

from aiohttp import ClientSession, TCPConnector

connector = TCPConnector(keepalive_timeout=120)
session = ClientSession(connector=connector)
blob_client = BlobServiceClient(account_url, account_key, session=session)

The one caveat here is that you are no longer using the default Session that azure-core provides for you which may mean missing out on some of our default configuration. Currently we don’t add much additional configuration beyond the defaults so it may not be a problem. You can see where azure-core constructs the ClientSession here so you may want to consider adding this additional configuration as well. We are currently thinking about ways to improve specifying this configuration.

Hopefully that information helps you. Please let me know if you have any further questions. Thanks.

CC @xiangyan99 and @annatisch

0reactions
msftbot[bot]commented, Oct 14, 2022

Hi @asafalon1, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance and scalability checklist for Blob storage
A checklist of proven practices for use with Blob storage in developing high-performance applications.
Read more >
Why is downloading from Azure blobs taking so long?
It just opens a file, then writes parts into it one by one. Works great, except it takes about 4 minutes when run...
Read more >
Cognitive fatigue influences students' performance on ... - PNAS
Thus, having breaks before testing is especially important in schools with students who are struggling and performing at low levels. To ...
Read more >
Supporting Child and Student Social, Emotional, Behavioral ...
home (and their long-term and consistent support), ... worse outcomes, such as depression and anxiety, ... between children or students, their families, ...
Read more >
Child Poverty and Adult Success - Urban Institute
Black children fare much worse: fully three-quarters (75.4 percent) are poor during childhood. The number for white children is substantial, yet considerably ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found