question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

updater: abstract out the network IO

See original GitHub issue

This might be relevant to Updater redesign (#1135) and if accepted would deprecate #1142 and the PR #1171

We (me, Joshua, Martin, Teodora) have been talking about abstracting some of the client functionality out of the Updater itself. The biggest issue from my perspective is network IO. Teodora already made a PR to let the application download targets but it seems like there are still issues with TUF handling metadata downloads.

Why is this needed?

  • In the real world applications are already using a network stack and will be using it after integrating TUF as well: we should not force another one on them
  • Even if the network stacks of the application and TUF are same, the fact that they use different sessions and configurations is not great
  • Complex applications have legitimate needs to configure a lot of things we don’t want to provide API for: user agent, proxies, basic authentication, custom request headers. This applies to both metadata and targets
  • Complex applications have legitimate needs to control the download process (e.g. progress information, canceling)
  • Complex applications have (legitimate?) needs to poke at low level details like timeouts

Potential solutions

We identified two main solutions to this:

  1. Make a new event-based non-blocking client API. This would be most flexible but also more complex for TUF maintainers to maintain and application developers to customize
  2. Keep the current API but add a new Fetcher interface that applications can optionally implement. This is likely fairly easy and non-invasive to implement but remains a blocking API

I’m proposing option 2 but for reference please see the draft of option 1 as well.

Proposal

Add a Fetcher interface that applications can implement. Provide a default implementation of Fetcher. Add a new method to Updater that Fetcher can use to provide the data it fetches.

The Updater processes (refresh(), get_one_valid_targetinfo() and download_target()) will now look like this:

  • Whenever a remote file (metadata or target) is needed:
    • setup a temporary file to write results to
    • call Fetcher.fetch()
      • fetcher calls Updater.provide_fetched_data() zero or more times to provide chunks of data. Updater writes these chunks into the file
    • when fetcher returns without exceptions, the download is finished and written to the file

This is like the go-tuf RemoteStore abstraction with two differences: 1. Python does not have reasonable stream abstractions like io.ReadCloser (that would actually be implemented by any of the network stacks) so we cannot return something like that: instead our implementation blocks and adds a provide_fetched_data() callback into Updater instead. 2. Metadata and target fetching is not separated: this way the Fetcher does not need any understanding of TUF or server structure, it’s just a dumb downloader.

# Only new/changed methods mentioned for Updater
class Updater(object):
    # init now accepts an optional fetcher argument
    def __init__(self, repository_name, repository_mirrors, fetcher: Fetcher = None):

    # Accepts content of the url that is being currently fetched.
    # Can be called only from Fetcher.fetch() that this Updater called.
    def provide_fetched_data(self, data: bytes)

# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
    # Fetches the contents of HTTP/HTTPS url from a remote server. Calls 
    # self.updater.provide_fetched_data() to forward sequential chunks of
    # bytes to the updater. Returns when the download is complete and all
    # bytes have been fed to updater.
    @abc.abstractmethod
    def fetch(self, url: str, length: int):
        pass

    # Called by updater init
    def set_updater(self, updater: Updater):
        self.updater = updater

I think this is fairly straight-forward to implement even without a client redesign (and will be backwards-compatible). download.py is split into two parts: one part contains the Tempfile handling bits and _check_downloaded_length() and are used by the updater itself; the rest of download.py form the default Fetcher implementation.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:34 (33 by maintainers)

github_iconTop GitHub Comments

4reactions
florimondmancacommented, Nov 30, 2020

👋

@trishankatdatadog: Yes, HTTPX has timeouts enabled by default for TCP connect/read/write, as well as connection pool acquiry. They’re all 5 seconds by default, and configurable. So e.g. if the remote server takes > 5s to send a chunk after HTTPX started to .recv(), we hard-shut the connection and raise an httpx.ReadTimeout exception.

I don’t know if this corresponds to the “slow retrieval” attack scenario here. E.g. it’s still possible for a remote server to send 1-byte chunks every 5s and be just fine as far as HTTPX is concerned.

We don’t have “write rate” or “max size” knobs built-in either. We do however provide a customization mechanism. HTTPX is actually separated in two projects: HTTPX itself, which does high-level client smarts, and HTTPCore, which does low-level HTTP networking. The interface between the two is the “Transport API”. HTTPX provides default transports, but it’s possible to switch it out for something else, such as a wrapper transport. Our docs on this are still nascent, but there are many features that can be implemented at this level. In particular anything that wants to control the flow of bytes (upload or download) would fit there very nicely. Example:

import httpcore

class TooBig(Exception):
    pass

class MaxSizeTransport(httpcore.SyncHTTPTransport):
    def __init__(self, parent: httpcore.SyncHTTPTransport, max_size: int) -> None:
        self._parent = parent
        self._max_size = max_size

    def _wrap(self, stream: Iterator[bytes]) -> Iterator[bytes]:
        length = 0.0
        for chunk in stream:
            length += len(chunk)
            if length > self._max_size:
                raise TooBig()
            yield chunk

    def request(self, *args, **kwargs):
        status_code, headers, stream, ext = self._parent.request(*args, **kwargs)
        return status_code, headers, self._wrap(stream), ext


import httpx

transport = httpx.HTTPTransport()  # Default transport.
transport = MaxSizeTransport(transport, max_size=...)  # Add "max size" layer.

with httpx.Client(transport=transport) as client:
    ...

There may be some bits to work out, but if this sounds like something that could fit your use case I’d be happy to work through any specifics.

(Getting down to the transport API level may also be totally overkill. HTTPX provides a nice and simple .stream() method that returns a streaming response — you can iterate over response content and do chunk-wise operations at that level too. But if you do need that amount of low-level-ness, it’s there.)


On a separate note, I read through this thread and wondered — was it considered to have .fetch() return a bytes iterator (most typically a generator)…?

class Fetcher(metaclass=abc.ABCMeta):
    @abc.abstractmethod
    def fetch(self, url: str, length: int) -> Iterator[bytes]:
        ...

The main benefit here is lower coupling — removes any references to the Updater at the Fetcher level. Fits the “we’ll call you” mindset Trishank mentioned earlier.

Here’s an example implementation using HTTPX:

import httpx

class HTTPXFetcher:
    def __init__(self):
        # v Optionally pass a custom `transport` to implement smart features,
        # like write rate or max size controls…
        self._client = httpx.Client()

    def fetch(self, url: str, length: int) -> Iterator[bytes]:
        with self._client.stream("GET", url) as response:
            for chunk in response.iter_bytes():
                # Perhaps do some checks based on `length`.
                yield chunk

The Updater can then “drive the flow” of the request, e.g. by iterating over the iterator…

with tempfile.SpooledTmpFile() as fp:
    # Drive the fetcher's flow of bytes...
    for chunk in self.fetcher.fetch(url, length):
        fp.write(chunk)
  # Persist file…

Not sure if this would fit the use case as I obviously have very limited context here 😃 For example this won’t work if you need file operations like .seek() — but thought I’d mention it. HTTPCore makes extensive use of iterators as a way to represent a “drivable” stream of bytes (both for request bodies and response bodies). In an HTTP context we really just need one-way iteration rather than full file-like controls.

4reactions
joshuaglcommented, Nov 30, 2020

Some great discussion here that has resulted in a refined design, thanks all! If I’m following correctly, we’ve agreed on the following to abstract out the network I/O:

# Only new/changed methods mentioned for Updater
class Updater(object):
    # init now accepts an optional fetcher argument
    def __init__(self, repository_name, repository_mirrors, fetcher: Optional[Fetcher] = None):

# New interface for applications to implement
class Fetcher(metaclass=abc.ABCMeta):
    # Fetches the contents of HTTP/HTTPS url from a remote server and writes them
    # to the provided file-object. Returns when the download is complete and all
    # bytes have been written to the file-object.
    @abc.abstractmethod
    def fetch(self, url: str, length: int, spooled_file: BinaryIO):
        pass
Read more comments on GitHub >

github_iconTop Results From Across the Web

Abstractions for Network Update - People
ABSTRACT. Configuration changes are a common source of instability in net- works, leading to outages, performance disruptions, and security vulnerabilities.
Read more >
Abstractions for Network Update - SIGCOMM-Sponsored Events
ABSTRACT. Configuration changes are a common source of instability in net- works, leading to outages, performance disruptions, and security.
Read more >
Moya/Moya: Network abstraction layer written in Swift. - GitHub
Sample Projects. We have provided two sample projects in the repository. To use it download the repo, run carthage update to download the...
Read more >
StarkNet Account Abstraction Model - Part 1
This post outlines the goals and model of account abstraction to be used in Starknet. We look to collect feedback and implement the...
Read more >
Operating Systems: I/O Systems
Management of I/O devices is a very important part of the operating system - so important and so varied that entire I/O subsystems...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found