Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A `pip resolve` command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!)

See original GitHub issue

Please let me know if it would be more convenient to provide this issue in another form such as a google doc or something!

What’s the problem this feature will solve?

At Twitter, we are trying to enable the creation of self-bootstrapping “ipex” files, executable zip files of Python code which can resolve 3rdparty requirements when first run. This approach greatly reduces the time to build, upload, and deploy compared to a typical PEX file, which contains all of its dependencies in a single monolithic zip archive created at pex build time. The implementation of “ipex” in pantsbuild/pants#8793 (more background at that link) will invoke pex at runtime, which will itself invoke a pip subprocess (since pex version 2) to resolve these 3rdparty dependencies. #7729 is a separate performance fix to enable this runtime resolve approach.

Because ipex files do not contain their 3rdparty requirements at build time, it’s not necessary to run the entirety of pip download or pip install. Instead, in pantsbuild/pants#8793, pants will take all of the requirements provided by the user (which may include requirements with inequalities, or no version constraints at all), then convert to a list of transitive == requirements. This ensures that the ipex file will resolve the same requirements at build time and run time, even if the index changes in between.

Describe the solution you’d like

A pip resolve command with similar syntax to pip download, which instead writes a list of == requirement strings, each with a single download URL, to stdout, corresponding to the transitive dependencies of the input requirements. These download URLs correspond to every file that would have been downloaded by pip download.

pants would be able to invoke pip resolve as a distinct phase of generating an ipex file. pex would likely not be needed to intermediate this resolve command – we could just execute pip resolve directly as a subprocess from within pants. The pants v2 engine makes process executions individually cacheable, and transparently executable on a remote cluster via the Bazel Remote Execution API, so pants users would then be able to generate these “dehydrated” ipex files at extremely low latency if the pip resolve command can be made performant enough.

Alternative Solutions / Prototype Implementation

As described above, pantsbuild/pants#8793 is able to create ipex files already, by simply using pip download via pex to extract the transitive == requirements. The utility of a separate pip resolve command, if any, would lie in whether it can achieve the same end goal of extracting transitive == requirements, but with significantly greater performance.

In a pip branch I have implemented a prototype pip resolve command which is able to achieve an immediate ~2x speedup vs pip download on the first run, before almost immediately levelling out to 800ms on every run afterwards.

This performance is achieved with two techniques:

Extracting the contents of the METADATA file from a url for a wheel without actually downloading the wheel at all.

_hacky_extract_sub_reqs() (see https://github.com/cosmicexplorer/pip/blob/a60a3977e929cfaed6d64b0c9e3713d7c502e51e/src/pip/_internal/resolution/legacy/resolver.py#L550-L552) will: a. send a HEAD request to get the length of the zip file b. perform several successive GET requests to extract the relative location of the METADATA file c. extract the DEFLATE-compressed METADATA file and INFLATE it d. parse all Requires-Dist lines in METADATA for requirement strings
This is surprisingly reliable, and extremely fast! This makes pip resolve tensorflow==1.14 take 15 seconds, compared to 24 seconds for pip download tensorflow==1.14.
A URL to a non-wheel file is processed the normal way – by downloading the file, then preparing it into a dist.

Caching the result of each self._resolve_one() call in a persistent json file.

RequirementDependencyCache implements this (see https://github.com/cosmicexplorer/pip/blob/a60a3977e929cfaed6d64b0c9e3713d7c502e51e/src/pip/_internal/resolution/legacy/resolver.py#L240).
This is keyed by RequirementConcreteUrl (see https://github.com/cosmicexplorer/pip/blob/a60a3977e929cfaed6d64b0c9e3713d7c502e51e/src/pip/_internal/resolution/legacy/resolver.py#L187), which is a pairing of an == Requirement with a url that it can be downloaded from.
- Therefore the cache file’s information will remain correct over time, as long as the indices only allow publishing a single version of a package exactly once.
This causes pip resolve invocations to stay at ~800-900ms in a “no-op” case when every transitive requirement is in the cache.
Both wheel and non-wheel requirements are cached.
This cache is also populated by pip download and everything else calling Resolver.resolve(self, ...), but only pip resolve will actually consume the cache in the current prototype.
If a user runs pip resolve once, then runs it again with a single input requirement string changed, most of the transitive requirements will remain cached, avoiding the need to make any network requests except to update the changed transitive requirements.

Additional context

This pip resolve command as described above (with the resolve cache) would possibly be able to resolve this long-standing TODO about separating dependency resolution from preparation, without requiring any separate infrastructure changes on PyPI’s part: https://github.com/pypa/pip/blob/f2fbc7da81d3c8a2732d9072219029379ba3bad5/src/pip/_internal/resolution/legacy/resolver.py#L158-L160

I have only discussed the single “ipex” motivating use case here, but I want to make it clear that I am making this issue because I believe a pip resolve command would be generally useful to all pip users. I didn’t implement it in the prototype above, but I believe that after the pip resolve command stabilizes and any inconsistencies between it and pip download are worked out, it would likely be possible to make pip download consume the output of pip resolve directly, which would allow removal of the if self.quickly_parse_sub_requirements conditionals added to resolver.py, as well as (probably) improve pip download performance by waiting to download every wheel file in parallel after resolving URLs for them with pip resolve!

For that reason, I think a pip resolve command which can quickly resolve URLs for requirements before downloading them is likely to be a useful feature for all pip users.

I am extremely open to designing/implementing whatever changes pip contributors might desire in order for this change to go in, and I would also fully understand if this use case is something pip isn’t able to support right now.

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:44 (43 by maintainers)

Top GitHub Comments

7reactions

pradyunsgcommented, Jul 15, 2020

Let me know if people aren’t ready for this yet

FWIW, I hadn’t communicated w/ everyone to figure out how we would be picking up the various parts of this task and implementing things.

@cosmicexplorer #8448 is a fairly large PR and I have a very strong bias toward keeping PRs smaller and following a change-a-single-thing per PR/commit style, to make it easier to review. IIUC, there’s 2 functional changes in this PR that we should break off into separate PRs:

partial wheel metadata download support
parallelization of wheel downloads (“hydration” here)

Notably, there’s also a logical/semnatic change in this PR – we’re no longer guaranteeing that the requirement_set returned after Resolver.resolve can be used immediately for installation (technically, we don’t do that today, but that’s because of sdists, not wheels).

I suggest we break that up into smaller chunks, that we tackle one-by-one:

partial wheel downloads for metadata (keeping “download entire file” as part of prepare)
move out “download entire file” out of prepare (and hence, Resolver.resolve)
parallelize download of multiple files

Right now, @McSinyx would be updating #8532.

I think we should probably have a couple of follow up PRs to (a) refactor/move the logic for “download entire file” and then (b) “new feature” implementation to parallelize those downloads (i.e. considering user-facing behavior, output etc). After #8532 is finalized, the main blocker on that front for #53/#7819, would be moving the “download entire file” logic out of the resolver’s scope. For @McSinyx’s GSoC project, the parallelization of the downloads (and the corresponding UI / UX work) would be the next big-fish task for them to work on.

Here’s how I suggest we move forward on the overlap of this issue and @McSinyx’s GSoC project:

@McSinyx updates #8532 to implement 1. above, according to the approach outlined in the aforementioned comment
@cosmicexplorer (or @McSinyx, if @cosmicexplorer isn’t available) refactors the logic for “download entire file” out of the resolver’s scope.
- once this is done, we can move on to other parts of #7819, like the dependency cache or #53 even.
@McSinyx works on figuring out how download parallelization would be presented to the user and modelled in pip’s codebase.

Does that sound like a reasonable plan to going forward @cosmicexplorer @McSinyx @pypa/pip-committers? I don’t want folks stepping on each other’s toes. 😃

5reactions

cosmicexplorercommented, Sep 1, 2020

Yes! And that is exactly the solution that people at Twitter including @kwlzn have proposed that we use to solve this. My interest in the client side approach is that it solves the problem for other people using tensorflow at large corporations who don’t pull from PyPI. We host an Artifactory instance, and I haven’t yet delved into how easy it would be to make the modifications to support the METADATA files as in the warehouse PR.

It seems to me that both of these approaches, when shipped to production, would likely have similar performance characteristics and produce the same result. I expect the PyPI change might end up being faster in the end, but I don’t know if, for example, some file contents get cached by the web server, and until most people are using the METADATA approach, it might end up being faster to pull tensorflow’s METADATA directly from the zip for that reason.

If this becomes outclassed by the working PyPI solution, I believe it still might not be replaceable for people who for whatever reason don’t have control of where they download their wheels from (and therefore can’t get a resolve using the metadata info). I don’t know how many of these people there are.

the comment I have on partial downloads is that I think those essentially cannot be verified – there is no way to know if you are being served the metadata that was originally created. In fact there’s no way to know if the metadata would even match the file hash included in the index file: a distribution mirror could serve whatever metadata it wants and it would get processed… Not sure if there’s a practical attack here but the possibility seems real.

So this is an extremely reasonable concern, and my first thought is that if we’re thinking about adding METADATA files to PyPI that those would probably have checksummed urls too? So the more canonical warehouse approach seems like it would be beneficial for security and that would be a great reason to retire this in favor of that once it gets going.

Separately, however, I’m not entirely sure how, if you have known checksums for wheels, that you could possibly avoid eventually checking those checksums during a pip resolve. The prototype I’ve implemented (which I’ve been meaning to spend more time on recently) will use the metadata information to pull down URLs to download everything from along the way, then download all the wheels in parallel at the end, presumably checking checksums, although I need to verify that.

I am working on another approach (it works too) that modifies pip to write resolve URLs to stdout instead of actually downloading anything, and then downloads them all in one go in parallel, when the application is first started. By not downloading the wheels and checking the checksums in the same pip invocation that gets the URLs, I can definitely see a potential avenue for exploitation. However, we still have to pull down wheels in the end, and pex just uses pip to resolve now, so it should be checking checksums in the same places where pip does.

I’m vaguely familiar with where checksum validation happens in pip but not enough to answer more confidently. I think security should generally be a huge concern when proposing massive changes to pip resolves and I think that it needs a little more research on my part to be able to say more confidently that it’s not going to introduce a huge issue.

EDIT: One last possible twist on this is that along with the zipfile-searching part of this PR, it also adds a cache of dependencies, keyed by the requirement download URL, serialized in a json file, and stored across pip runs. If we wanted to methodically address the checksumming issue, it’s possible we could store checksums from previous downloads there. That code is hairy and needs to be replaced anyway though, and I’m not sure how big that json file would get over time especially if we started adding longer strings to it. It would would be at best a workaround for the problem – I believe the known attack vector of pulling a newly released version of a dependency from PyPI would reliably avoid the json cache, so it’s not a solution here.

Top Results From Across the Web

Dependency Resolution - pip documentation v22.3.1

The process of determining which version of a dependency to install is known as dependency resolution. This behaviour can be disabled by passing...

Dependency Management with Pip, Python's Package Manager

Pip is Python's package manager, providing essential core features for installing and maintaining Python packages and their dependencies.

playsound is relying on another python subprocess. please ...

pypa/pipA `pip resolve` command to convert to transitive == requirements very fast by scanning wheels for static dependency info (WORKING PROTOTYPE!)#7819.

pip-audit - PyPI

pip -audit ... pip-audit is a tool for scanning Python environments for packages with known vulnerabilities. It uses the Python Packaging Advisory Database...

pip install — pip 10.0.0.dev0 documentation

Identify the base requirements. The user supplied arguments are processed here. Resolve dependencies. What will be installed is determined here. Build wheels.