Ensure 'pip wheel' can create .so artifacts deterministically
See original GitHub issueWhat’s the problem this feature will solve?
The Bazel build system has the major selling point of supporting both local and remote-caching.
In order for that caching to work though, Bazel targets must be built deterministically so that the same target always has the same content-addressable hash.
Currently pip wheel
is non-deterministic, so our Python Bazel targets will cache miss if they depend on something built with pip wheel
.
Describe the solution you’d like
Note: The following is the output of a Bazel execution log. A bit unrelated to the
pip wheel
command but shows the relevant information.
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/LICENSE"
digest {
hash: "a2adb9c959b797494a0ef80bdf60e22db2749ee3e0c0908556e3eb548f967c56"
size_bytes: 1101
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/METADATA"
digest {
hash: "df7bc0c7cbd2ce350c5c61ceda3a74bbcb6f82446a7c01f7f8e1034a98df231f"
size_bytes: 1704
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/RECORD"
digest {
hash: "6fe803b74ab4fcab1f23e96060cf062d12779598af7e72692c492c2dd7cad0ed"
size_bytes: 1701
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/WHEEL"
digest {
hash: "cdf2c8f141bc498ae490a88870d655dd174abe3db8c1f57562224b168930c624"
size_bytes: 104
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/top_level.txt"
digest {
hash: "ae98f42153138ac02387fd6f1b709c7fdbf98e9090c00cfa703d48554e597614"
size_bytes: 11
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/_yaml.cpython-36m-x86_64-linux-gnu.so"
digest {
hash: "a7f3774015f839ccee5e2281bbfdf22a42e0e1dafaac33ef4c91db83a07210d9"
size_bytes: 1133288
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/yaml/__init__.py"
digest {
hash: "2af8b6dbcb1df5c63597f215421cad02f2317e291061b181b0f7bbf4f71ac0dd"
size_bytes: 12012
hash_function_name: "SHA-256"
}
}
The following is a subset of the build outputs of the PyYAML
package. Of the build outputs, it is the RECORD
files and the _yaml.cpython-36m-x86_64-linux-gnu.so
shared object file that have non-deterministic hashes build to build. I have inspected the RECORD
file and found that it contains the hash of the .so
file, so it is non-deterministic because of the .so
file, and I think only because of that.
So the problem is the .so
file.
I ran the strings
program on the .so
file and found this printable string: /tmp/pip-wheel-_bd8v3f2/pyyaml
. That is coming from here:
So while I found other differences between different _yaml.cpython-36m-x86_64-linux-gnu.so
, this tmp directory usage leaking in itself is sufficient to break determinism.
Additional context
rules_python
issue discussing this problem: https://github.com/bazelbuild/rules_python/issues/154
rules_python
repo: https://github.com/bazelbuild/rules_python
Issue Analytics
- State:
- Created 4 years ago
- Reactions:12
- Comments:19 (9 by maintainers)
Top GitHub Comments
In a simple test, I was able to get consistent builds by exporting
CFLAGS=-g0
before building the wheel. This prevents adding any of the debug information to the generated libraries which is where the TempDirectory was being pulled in. I also haveSOURCE_DATE_EPOCH
set. I don’t know how universal this is (and, of course, you lose debugging symbols).It is slightly different, since building in the source tree does not necessarily mean the built artifacts are in the source tree. It is only by tradition the most popular back-end (setuptools) does this. Having in-tree builds would happen to solve the immediate problem, but IMO the ultimate solution to this problem would be to introduce a flag to PEP 517 that can tell the back-end where they must generate the artifact in, and create a flag in pip to let user provide that information.