updater: abstract out the network IO
See original GitHub issueThis might be relevant to Updater redesign (#1135) and if accepted would deprecate #1142 and the PR #1171
We (me, Joshua, Martin, Teodora) have been talking about abstracting some of the client functionality out of the Updater itself. The biggest issue from my perspective is network IO. Teodora already made a PR to let the application download targets but it seems like there are still issues with TUF handling metadata downloads.
Why is this needed?
- In the real world applications are already using a network stack and will be using it after integrating TUF as well: we should not force another one on them
- Even if the network stacks of the application and TUF are same, the fact that they use different sessions and configurations is not great
- Complex applications have legitimate needs to configure a lot of things we don’t want to provide API for: user agent, proxies, basic authentication, custom request headers. This applies to both metadata and targets
- Complex applications have legitimate needs to control the download process (e.g. progress information, canceling)
- Complex applications have (legitimate?) needs to poke at low level details like timeouts
Potential solutions
We identified two main solutions to this:
- Make a new event-based non-blocking client API. This would be most flexible but also more complex for TUF maintainers to maintain and application developers to customize
- Keep the current API but add a new Fetcher interface that applications can optionally implement. This is likely fairly easy and non-invasive to implement but remains a blocking API
I’m proposing option 2 but for reference please see the draft of option 1 as well.
Proposal
Add a Fetcher interface that applications can implement. Provide a default implementation of Fetcher. Add a new method to Updater that Fetcher can use to provide the data it fetches.
The Updater processes (refresh()
, get_one_valid_targetinfo()
and download_target()
) will now look like this:
- Whenever a remote file (metadata or target) is needed:
- setup a temporary file to write results to
- call
Fetcher.fetch()
- fetcher calls
Updater.provide_fetched_data()
zero or more times to provide chunks of data. Updater writes these chunks into the file
- fetcher calls
- when fetcher returns without exceptions, the download is finished and written to the file
This is like the go-tuf RemoteStore abstraction with two differences: 1. Python does not have reasonable stream abstractions like io.ReadCloser (that would actually be implemented by any of the network stacks) so we cannot return something like that: instead our implementation blocks and adds a provide_fetched_data()
callback into Updater instead. 2. Metadata and target fetching is not separated: this way the Fetcher does not need any understanding of TUF or server structure, it’s just a dumb downloader.
# Only new/changed methods mentioned for Updater
class Updater(object):
# init now accepts an optional fetcher argument
def __init__(self, repository_name, repository_mirrors, fetcher: Fetcher = None):
# Accepts content of the url that is being currently fetched.
# Can be called only from Fetcher.fetch() that this Updater called.
def provide_fetched_data(self, data: bytes)
# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
# Fetches the contents of HTTP/HTTPS url from a remote server. Calls
# self.updater.provide_fetched_data() to forward sequential chunks of
# bytes to the updater. Returns when the download is complete and all
# bytes have been fed to updater.
@abc.abstractmethod
def fetch(self, url: str, length: int):
pass
# Called by updater init
def set_updater(self, updater: Updater):
self.updater = updater
I think this is fairly straight-forward to implement even without a client redesign (and will be backwards-compatible). download.py is split into two parts: one part contains the Tempfile handling bits and _check_downloaded_length() and are used by the updater itself; the rest of download.py form the default Fetcher implementation.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:34 (33 by maintainers)
Top GitHub Comments
👋
@trishankatdatadog: Yes, HTTPX has timeouts enabled by default for TCP connect/read/write, as well as connection pool acquiry. They’re all 5 seconds by default, and configurable. So e.g. if the remote server takes > 5s to send a chunk after HTTPX started to
.recv()
, we hard-shut the connection and raise anhttpx.ReadTimeout
exception.I don’t know if this corresponds to the “slow retrieval” attack scenario here. E.g. it’s still possible for a remote server to send 1-byte chunks every 5s and be just fine as far as HTTPX is concerned.
We don’t have “write rate” or “max size” knobs built-in either. We do however provide a customization mechanism. HTTPX is actually separated in two projects: HTTPX itself, which does high-level client smarts, and HTTPCore, which does low-level HTTP networking. The interface between the two is the “Transport API”. HTTPX provides default transports, but it’s possible to switch it out for something else, such as a wrapper transport. Our docs on this are still nascent, but there are many features that can be implemented at this level. In particular anything that wants to control the flow of bytes (upload or download) would fit there very nicely. Example:
There may be some bits to work out, but if this sounds like something that could fit your use case I’d be happy to work through any specifics.
(Getting down to the transport API level may also be totally overkill. HTTPX provides a nice and simple
.stream()
method that returns a streaming response — you can iterate over response content and do chunk-wise operations at that level too. But if you do need that amount of low-level-ness, it’s there.)On a separate note, I read through this thread and wondered — was it considered to have
.fetch()
return a bytes iterator (most typically a generator)…?The main benefit here is lower coupling — removes any references to the
Updater
at theFetcher
level. Fits the “we’ll call you” mindset Trishank mentioned earlier.Here’s an example implementation using HTTPX:
The
Updater
can then “drive the flow” of the request, e.g. by iterating over the iterator…Not sure if this would fit the use case as I obviously have very limited context here 😃 For example this won’t work if you need file operations like
.seek()
— but thought I’d mention it. HTTPCore makes extensive use of iterators as a way to represent a “drivable” stream of bytes (both for request bodies and response bodies). In an HTTP context we really just need one-way iteration rather than full file-like controls.Some great discussion here that has resulted in a refined design, thanks all! If I’m following correctly, we’ve agreed on the following to abstract out the network I/O: