question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

optimize package installation for space and speed by using copy-on-write file clones ("reflinks") and storing wheel cache unpacked

See original GitHub issue

What’s the problem this feature will solve?

Creating a new virtual environment in a modern Python project can be quite slow, sometimes on the order of tens of seconds even on very high-end hardware, once you have a lot of dependencies. It also takes up a lot of space; my ~/.virtualenvs/ is almost 3 gigabytes, and this is a relatively new machine; and that isn’t even counting my ~/.local/pipx, which is another 434M.

Describe the solution you’d like

Rather than unpacking and duplicating all the data in wheels, pip could store the cache unpacked, so all the files are already on the filesystem, and then clone them into place on copy-on-write filesystems rather than copying them. While there may be other bottlenecks, this would also reduce disk usage by an order of magnitude. (My ~/Library/Caches/pip is only 256M, and presumably all those virtualenvs contain multiple full, uncompressed copies of it!)

Alternative Solutions

You could get a similar reduction effect by setting up an import hook, using zipimport, or doing some kind of .pth file shenanigans but I feel like those all have significant drawbacks.

Additional context

Given that platforms generally use shared memory-maps for shared object files, if it’s done right this could additionally reduce the memory footprint of python interpreters in different virtualenvs with large C extensions loaded.

Code of Conduct

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:30 (20 by maintainers)

github_iconTop GitHub Comments

3reactions
dstufftcommented, May 11, 2022

I don’t think you can solve this in the virtual environment abstraction? At least I’m not sure how you’re envisioning that working? The virtual environment abstraction largely is just setting up sys.path, how things get installed onto that sys.path isn’t really it’s concern, unless you have something else in mind that I’m not thinking of? Solving it there also doesn’t solve it for cases that aren’t inside of a virtual environment.

I think the only reasonable path here is pretty straight forward:

  1. Within the wheel cache, stop caching zipped up wheels, unpack them and cache them unpacked.
  2. Start caching things that we’ve downloaded as wheels within the wheel cache, unpacked as well.
    • This might mean that we want to stop caching downloads in the HTTP cache completely then, since they’ll be cached inside of the wheel cache always then. Though maybe we would still want HTTP caching for sdists? I dunno.
  3. Adjust wheel installing so that instead of operating on a zipped wheel, it operates on an unzipped wheel, and uses shutil.copytree to copy out of the wheel cache.

This has some immediate benefits:

  • Right away installation of a cached wheel gets faster, because instead of copying data out of a zip file, decompressing it, then writing it to disk, we rely on shutil to use the most efficient way to copy the file.
  • We make our caching more consistent, we no longer have some wheels cached inside of the HTTP cache, and some wheels cached inside of the wheel cache, they’re just all cached inside of the wheel cache.
    • We maybe still cache sdists in the HTTP cache, but at least that is an entirely different format.

With some immediate downsides:

  • Old wheel cache is no longer useful
  • We’re storing the cache uncompressed, so it will take up more room than storing it compressed.

Then it also has some longer term benefits:

  • If/when reflink support gets added to shutil, we should just automatically start taking advantage of it when possible, and in general we get automatic improvements from shutil (e.g. even without reflink, once it starts using os.copy_file_range that’s an additional speed up.
  • It makes it really easy to add features to allow people to opt into additional performance enhancements that change the semantics of an install. For instance, we could add a flag that would attempt to use hard links or sym links if possible, which breaks the virtual environment isolation if people are editing installed modules, but which would be another large speed up and space savings. That would be implementable by just passing in a different copy_function to shutil.copytree.
1reaction
RobertRoscacommented, Dec 1, 2022

I did work on a proof of concept that tries to solve this issue just in a slightly different way, it uses installer to implement a basic wheel installer that installs packages to multi-site-packages/{package_name}/{package_version}, but instead of putting reflinks/symlinks to packages inside the site-packages directory of a venv it relies on using a custom importlib finder which reads a lockfile and inserts the path to the requested version of the package into sys.path before importing.

Made a post on the Python forums here if anybody would like to join the discussion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Environments with a Shared Package Installation Directory ...
My aim was to achieve this with as few changes as possible and in as simple of a way as possible, I'm aware...
Read more >
Optimize pip install with wheels - Roman Imankulov
Having that `pip install -r requirements.txt` that takes ages to install? Make sure you install all binary packages from wheels.
Read more >
Reflinks vs Symlinks vs Hard Links, and How They Can Help ...
Using file linking techniques is nothing new to the field, actually. Some data science teams use symlinks to save space and avoid copying...
Read more >
[debbugs-tracker] Unanswered problem reports by maintainer and ...
Package Ref Subject address@hidden (1 bugs): woodchuck 10438 Use the cellular ... package installation broken guix 23666 'add-to-store' RPC loads files in ...
Read more >
eagle.fish.washington.edu/whale/fish546/Trinity_r2...
src/dd.c (skip): Handle skipping past EOF on shared or typed memory objects the same way as with regular files. (dd_copy): It's OK to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found