question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pipenv PEP 503 Improvement: Pipenv downloads PyTorch for all versions of Python, grabbing 16GB of data instead of just 1.7GB.

See original GitHub issue

I recently posted the correct way to install PyTorch as a PEP 503 repository in Pipenv:

https://github.com/pypa/pipenv/issues/4961#issuecomment-1045679643

There’s just one annoying issue in Pipenv: It downloads PyTorch for every version of CPython.

So let’s say my project is based on pipenv install --python=3.9. And I then run the command to install PyTorch (see guide above for details): pipenv install --extra-index-url https://download.pytorch.org/whl/cu113/ "torch==1.10.1+cu113".

Well, Pipenv then downloads all versions of PyTorch into ~/.cache/pipenv: cp36, cp37, cp38, cp39 and probably a few more. And then it finally installs the intended architecture (torch-1.10.1+cu113-cp39).

This means that the download took 16 GB and 30 minutes, instead of 1.7 GB and 4 minutes. Wasting a ton of disk space and time on downloading extra copies of the library for old Python versions that I’ll never use.

I confirmed that the extra downloaded data is versions for old Python releases, because I went into the Pipenv cache and looked inside the hashed archives to check their WHEEL metadata. It was stuff like the “Python 3.6” torch version etc.

I’m using pipenv 2022.1.8.

My guess is that Pipenv’s current algorithm just searches PEP 503 repos for packages whose name start with torch-* and downloads them ALL and then looks at the embedded “wheel metadata” in all downloaded archives to figure out which one matches the installed Python version.

Can Pipenv be improved to detect the “cp39” filename hints in PEP 503 repos and only download the version that matches the installed Python version?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14

github_iconTop GitHub Comments

1reaction
matteiuscommented, Feb 19, 2022

@Bananaman Thanks for your feedback, and I am pretty new here to this code base still but from what I gather about the dependency resolution is that this may require an upstream change somewhere, but I think this is good discussion and could lead to some improvements.

1reaction
Bananamancommented, Feb 19, 2022

@matteius

I think I see now that the reason you are using the other package server is you are looking for a cuda specific version of torch that is not in pypi?

Yeah, my card requires PyTorch built for CUDA Toolkit 11.x, which can only be found at the PyTorch repository.

I believe the issue here is that the private server https://download.pytorch.org/whl/cu113/ isn’t returning the package hashes directly, so pipenv is downloading everything to generate it. If the packages were in pypi, my understanding is the API would return the metadata and nothing would be downloaded.

Well there’s 2 issues here:

  1. It downloads every package (cp36, cp37, cp38, cp39 and seemingly a few others since the total ended up at 16 GB). It only needs to download cp39 (1.7 GB) no matter what command I give it, since my Python interpreter in that Pipenv folder is Python 3.9. The other packages that pipenv downloaded aren’t even compatible with my Python version. So an optimization would be to filter out the other “cp##” versions and not even download/consider them at all. That’s the main issue here.
  2. The second issue is the one you mention, which is that Pipenv doesn’t know what hashes the files on the server has, so any future re-installation may need the packages to be downloaded and hashed again to check for changes. That’s an issue which would cause the huge wait times from the first issue again, since everything would re-download (ouch). Oh and every time the project’s CUDA / PyTorch version is updated, it’d cause a huge download of 16 GB of data of all architectures for the next Torch version again… ouch. The pipenv cache would grow extremely large after just a few versions of Torch.

The best fix would be to do “if running under CPython, look for matching identifier in package filenames such as ‘cp39’ and only download that/those if such an identifier is found”.

As far as I have heard, the -cp39- stuff is standardized or at least “the way everyone does it”. The pattern is packagename-packageversion-cp##-morestuffandCPUarch. So if filenames follow the packagename-packageversion-cp##- pattern, we can strongly assume that it’s an indicator “this is the CPython 3.9 version” and thereby instantly know which packages we can skip from PEP 503 repos.

There’s lots of room for improvement of Pipenv’s PEP 503 support. Phase 1 could be "Skip every -cp##- version that doesn’t match ours. Phase 2 would be to skip every packagename-version that wasn’t requested (no need to download 1.10.2 if 1.10.2+cu113 was requested). Phase 3 would be to skip every -architecture (i.e. Linux, Mac, etc) that your system doesn’t have.

The most important thing would be to skip the other -cp##- versions because that’s a huuuuge amount of data to download.

How feasible is it that Pipenv can be extended to filter out useless downloads? Hopefully the internal code isn’t too rigid.

Read more comments on GitHub >

github_iconTop Results From Across the Web

potential-bugs UselessPostfixExpression does not work in all cases ...
Pipenv PEP 503 Improvement : Pipenv downloads PyTorch for all versions of Python, grabbing 16GB of data instead of just 1.7GB. 14, 2022-02-19,...
Read more >
fixes #569 remove ApiGen - Wp-Graphql/Wp-Graphql - IssueHint
Pipenv PEP 503 Improvement : Pipenv downloads PyTorch for all versions of Python, grabbing 16GB of data instead of just 1.7GB. 13, 2022-02-19,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found