Pip resolver should prefer cause of conflicts when backtracking
See original GitHub issueWhat’s the problem this feature will solve?
This can drastically improves the performance of real world dependency conflicts where pip needs to backtrack. Specifically it fixes https://github.com/pypa/pip/issues/10201
Describe the solution you’d like
When you have dependencies on packages A, B, C, and both A and B depend on X, but latest versions of A and B depend on mutually exclusive versions of X, pip should prefer resolving A and B, pip should not prefer trying to resolve C. This is for 2 reasons:
- In the real world in general if you have some package Foo version n and it depends on some package Bar then Foo version n-1 is likely to also depend on some package Bar. Therefore in the above example it makes sense to focus on A and B as they need to resolve what version of X they both mutually agree on
- It is intuitive to end users that packages which are causing the conflict are the ones pip should try to resolve, not some package which is not part of the current conflict
Alternative Solutions
There are probably clever graph theory / dependency tree techniques that can improve general performance here. This however is a very small change that only slightly alters the behavior of get_preference
.
Additional context
I will submit PRs based on the following diff: https://github.com/notatallshaw/pip/compare/21.2.3...notatallshaw:third_attempt_at_prefer_non_conflicts
However I created this issue to convince pip maintainers first, as there is only a limited amount of evidence I can give and it is based on real world reports (i.e. anecdotal reports, I have about 9 reproducible examples from people reporting issues to Pip’s github, if you have more or know where I can find more examples please let me know). In particular I do not have any test cases because:
- There are no existing unit tests for
get_preference
- As best as I can tell there are no existing functional tests which infer the behavior of
get_preference
- As best as I can tell there are no existing performance tests along the lines of “given this dependency tree how many times does pip have to backtrack”
So given that let me explain what limitations I think there are to this approach:
Real World Limitations
Of my testing the biggest limitation I found was if the pip resolver has already pinned one of the failing causes long before the failure happens, this results in the resolver backtracking for a long period of time
This can be shown with the requirement apache-airflow[all]==1.10.13
, where one of the causes of the causing failures is moto
. However moto
is pinned by the pip resolver very early on and therefore will continue to be pinned for a long time before it gets unpinned, therefore alternative versions of moto
can not be explored until pip spends a long time backtracking.
This situation is no worse than the current resolver, and I actually think this modification will make this situation orders of magnitude faster than the current resolver (but this might be billions of years to resolve instead of heat deaths of the universe time frame).
Theoretical limitations
Fundamentally this change is just to get_preference
and therefore the order of package choice in resolving when backtracking, so it will be possible to construct a dependency tree that will be slower under this change than the current resolver.
I have thought of a possible real world scenario where this might happen: You require packages A and B. A has a complex dependency tree, and between A version n-1 and A version n that dependency tree gets completely changed. B is completely incompatible with the deeper dependencies of A version n and we must backtrack to A version n-1 to find a solution.
Through luck the current resolver might backtrack on the right path and resolve quickly, Where as in this case focusing on the failures between B and A’s dependencies might cause a long backtracking to happen as focusing on the failures here is a red herring and you need to backtrack all the way to A version n-1.
Though I have not found any real world examples of this scenario where focusing on the failures is a red herring, I am sure with enough time and Python projects someone will eventually find an example.
Code of Conduct
- I agree to follow the PSF Code of Conduct.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:10
- Comments:12 (12 by maintainers)
Top GitHub Comments
A further limitation of this change is it takes away some of the power of user ordering when backtracking. If users perfectly construct the order of their requirements file this approach will partially disrupt that when backtracking.
Though user ordering feature is largely undocumented and I suspect you would be hard pressed to 1) find anyone actually using this feature and even if you did 2) find a situation where there is a complex backtracking problem and for a user to be able to solve it themselves via order changing.
Okay here are the 3 pull requests that make up the version of the resolver I have been testing:
Please let me know what you think and if there’s anything more I can do to convince you that this is a good solution.