Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: AsyncClient API unification

See original GitHub issue

To fully support HTTP/2, httpx will want to support multiplexing, arguably the most important HTTP/2 feature:

For both SPDY and HTTP/2 the killer feature is arbitrary multiplexing on a single well congestion controlled channel. It amazes me how important this is and how well it works. One great metric around that which I enjoy is the fraction of connections created that carry just a single HTTP transaction (and thus make that transaction bear all the overhead). For HTTP/1 74% of our active connections carry just a single transaction - persistent connections just aren’t as helpful as we all want. But in HTTP/2 that number plummets to 25%. That’s a huge win for overhead reduction. Let’s build the web around that. (HTTP/2 is Live in Firefox, Patrick McManus)

Existing `parallel()` API

Since HTTP/2 is multiplexed, it means that clients such as httpx should only open one connection per origin, even (especially?) for concurrent requests. This is indeed possible with the proposed parallel() API:

client = httpx.AsyncClient()
async with client.parallel() as parallel:
    pending_one = await parallel.get('https://example.com/1')
    pending_two = await parallel.get('https://example.com/2')
    response_one = await pending_one.get_response()
    response_two = await pending_two.get_response()

In that case, the TCP connection to example.com can live in an async task owned by the parallel object, receive orders with await parallel.get() and return responses with the await pending.get_response() calls. That works well, but I believe four things could be improved here:

The APIs for performing serial requests and for performing parallel requests are different, so one must choose carefully, and evolving one’s code from one API to the other requires work.
And since most people don’t read the docs, they’re likely to launch requests in parallel using the client directly, and likely won’t notice that their code does not take advantage of HTTP/2 multiplexing.
Launching tasks is not the job of the HTTP client: that should be left to each async framework, since they all have a preferred style.
The parallel API increases the API surface of httpx.

Proposed unified async API

Based on those observations, I believe a better API would be to only allow instantiating the client using a context manager, eg. async with httpx.AsyncClient() as client. This then allows different styles. I’m not very familiar with asyncio, but I believe the above example would become:

async with httpx.AsyncClient() as client:
    response_one = asyncio.create_task(client.get('https://example.com/1'))
    response_two = asyncio.create_task(client.get('https://example.com/2'))
    await response_one
	await response_two

But you can also use other primitives, such as asyncio.gather:

async with httpx.AsyncClient() as client:
    for response in await asyncio.gather(
            client.get('https://example.com/1'),
            client.get('https://example.com/2')]:
        ...

And this fits more easily with other async frameworks, such as trio:

async with httpx.AsyncClient("trio") as client:
    async with trio.open_nursery() as nursery:
        nursery.start_soon(client.get, 'https://example.com/1')
        nursery.start_soon(client.get, 'https://example.com/2')

And it would also be the preferred way to launch single requests:

async with httpx.AsyncClient() as client:
    response = await client.get('https://example.com/1')

What about the sync client?

The same logic applies for the sync client, and I would personally reuse what Python already offers to perform tasks in parallel: concurrent.futures. I guess an executor specific to httpx using asyncio behind the scenes could work.

Of course, it makes sense to keep http.get(url) for backwards compatibility with requests and to deal with the common case where users only need to make a single sync request.

Conclusion

I believe this new proposed API has the following advantages:

it’s more unified (only one way to do it)
it’s natural to go from serial requests to concurrent requests
it delegates parallel task creation to the async framework
it reduces the API surface

The drawbacks are that the sync/async cases no longer use the same parallel() API and that the simple common async case is slightly more cumbersome to type, but I personally believe that this proposal is a better compromise.

What do you think?

(Disclaimer: this idea is originally from @njsmith, I took the time to turn it into a proposal and probably add errors of my own.)

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:14 (9 by maintainers)

Top GitHub Comments

1reaction

njsmithcommented, Aug 21, 2019

I dunno, I’m mostly relying on @lukasa here, but he spent some effort pounding it into my head that HTTP/2 definitely needed a task reading from the underlying connection at all times.

I don’t think letting whichever task happens to be interacting with the connection do the driving is going to work in any case, because of cancellation. If your data send operation gets cancelled in the middle, it’s extremely difficult to recover in any reasonable way. If it’s in a background task where cancellation means that the whole async with open_session block is getting closed down, then that’s OK, but if cancelling a get call can cause the underlying connection to get corrupted then that’s no good.

We’ve spent a lot of time trying to figure out these patterns over the last few years, with simpler protocols like TLS and websocket. I originally thought like you are now, but I discovered I was wrong 😃. Trio’s TLS code does use the pattern you describe. That’s a much simpler protocol flow-control wise – basically just two unidirectional streams that barely interact – and I think it’s about at the limit of what you can handle that way; it took a ton of effort to get working and it still has some edge cases around cancellation that I’m not quite sure we’re handling right.

1reaction

tomchristiecommented, Aug 21, 2019

To fully support HTTP/2, httpx will want to support multiplexing, arguably the most important HTTP/2 feature.

We do.

If you’re using multiple tasks, or multi threading, with a single client, then any HTTP/2 connections will already be multiplexed. In eg. a web app environment you don’t really see that because the server is handling the concurrency aspect for you, so each individual code path reads as a single sequential request, but you’ll actually end up having multiplexed requests across the same client. And yes, the AsyncClient can be used with any standard concurrency primitives (for whichever backend).

(Proviso: I think we may have some niggly races that aren’t yet resolved in the threaded case, but that’s an “we’re in alpha” buglette, rather than an interface issue.)

Also “instantiating the client using a context manager, eg. async with httpx.AsyncClient() as client” isn’t the right level here. Eg. supposing we’re in a web app, then…

async def homepage():
    async with httpx.AsyncClient() as client:
        ... # Load a couple of resources in parallel
    ... # Return a response

Is the wrong thing to do, because you want to make sure you’re using shared connection pooling across all the incoming requests, rather than just within the context of a single endpoint during a single request/response cycle. So what you actually want is…

async def homepage():
    async with TrioNurseryOrAsyncioSupervisorOrClientParallelContextOrWhatever:
        ... # Load a couple of resources in parallel using a globally shared `client` instance.
    ... # Return a response

What this issue actually reduces to is “let’s not introduce the parallel requests API”.

That’s feasible although there’s two primary reasons why we might want the parallel requests API…

Providing async concurrency but from within a standard threaded interface. (This is the big one really, we can give users a really lightweight way of taking advantage of concurrent requests but within standard threaded codebases.)
We really only want users to be using structured concurrency styles for branching. The parallel requests API meets that constraint, and stops folks using weaker asyncio primitives. (This is the weak of the two reasons. and could be addressed in other ways, in particular by async adopting an equivelent to “nursery”, and starting to educate the ecosystem to prefer structured concurrency styles wherever possible)

I think there’s also a broader issue here, around context manged vs. non context managed APIs. In particular:

Client instances. (This would ensure that connection pools are strictly closed off, once a client is out of scope. Practically most enviroments will actually just want a single client for the lifetime of the application, so it’s probably pretty useful to provide either context managed or unmanaged styles.)
Streaming responses. (Right now we’re following the requests API, which gives either context managed or unmanaged styles. The unmanaged style is a real gotcha there to my eyes, as it’s super easy to create and not close streaming responses. That probably doesn’t end up being a hard bug in practice, but rather just an unseen drag on resources or whatever)

Summary: I think that we should close the issue off, but not start or consider #52 until we’ve fully addressed the prerequisities of making sure that we’ve got support for both an asyncio and a trio backend, and that we’re all happy whatever ConcurrencyBackend interfaces are neccessary in order to adequately support that.