Big endian not enforced in float arrays in de/serialisation
See original GitHub issueDescribe the bug
newbyteorder
does not change the order of the bytes, but only the order that they are interpreted in. e.g.
>>> arr = np.arange(3).astype(float)
>>> arr
array([0., 1., 2.])
>>> arr.newbyteorder(">")
array([0.00000e+000, 3.03865e-319, 3.16202e-322])
So this function writes exactly the same bytes in the same order that were in the array before.
Deserialisation
newbyteorder
is not side-effecting, so this doesn’t do anything. See https://github.com/openforcefield/openff-interchange/issues/345 for more.
To Reproduce
Output
Computing environment (please complete the following information):
- Operating system
- Output of running
conda list
Additional context
The reason that the endian enforcement has not shown up in tests is because both functions have no effect, so they default to the native endiannness of the system. This likely hasn’t been a problem for users transferring files because by and large people work on x86, which is little-endian. I think one could test the functions individually by creating arrays with opposite endians and checking that they are serialized and deserialized properly. e.g. on my Mac
>>> import numpy as np
>>> from openff.toolkit.utils.utils import serialize_numpy, deserialize_numpy
>>> import sys
>>> sys.byteorder
'little'
>>> dt_bigendian = np.dtype(float).newbyteorder(">")
>>> arr = np.arange(3).astype(dt_bigendian)
>>> arr
array([0., 1., 2.])
>>> np.frombuffer(arr.tobytes())
array([0.00000e+000, 3.03865e-319, 3.16202e-322])
>>> actually_serialize_numpy = lambda x: (x.astype(dt_bigendian).tobytes(), x.shape)
>>> actually_deserialize_numpy = lambda x, y: np.reshape(np.frombuffer(x, dtype=dt_bigendian), y)
>>> np.frombuffer(actually_serialize_numpy(arr)[0])
array([0.00000e+000, 3.03865e-319, 3.16202e-322])
>>> little_arr = np.arange(3).astype(float)
>>> np.frombuffer(serialize_numpy(little_arr)[0], dtype=dt_bigendian)
array([0.00000e+000, 3.03865e-319, 3.16202e-322])
>>> actually_deserialize_numpy(*actually_serialize_numpy(arr))
array([0., 1., 2.])
>>> actually_deserialize_numpy(np.arange(3).astype(float).tobytes(), (3,))
array([0.00000e+000, 3.03865e-319, 3.16202e-322])
So tests could look like:
def test_serialize_numpy():
original = np.arange(3).astype(float)
dt_little = np.dtype(float).newbyteorder("<")
dt_big = np.dtype(float).newbyteorder(">")
arr = original.astype(dt_little)
assert_allclose(arr, original)
deserialized = np.from_buffer(serialize_numpy(arr)[0], dtype=dt_big)
assert_allclose(deserialized, original)
def test_deserialize_numpy():
original = np.arange(3).astype(float)
dt_big = np.dtype(float).newbyteorder(">")
arr = original.astype(dt_big)
deserialized = deserialize_numpy(arr.tobytes(), arr.shape)
assert_allclose(deserialized, original)
def test_serialization_roundtrip():
original = np.arange(3).astype(float)
deserialized = deserialize_numpy(*serialize_numpy(original))
assert_allclose(deserialized, original)
These should fail appropriately on a little-endian system; on a big-endian system, code where native endianness is used instead it might pass / fail silently. I think test_serialize_numpy
should guard against that.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
No worries! I am in NYC at the moment, though, so any feedback might be delayed.
Thanks for catching this, @lilyminium! This highlights the danger of my “just frankenstein code from stackoverflow” method for code development. I’ve read a bit into it and I think I get the gist, but since you have this more fully understood is it OK if @mattwthompson tags you for review once he has the PR in an acceptable state? I don’t have time to go as deep right now so I’d most likely just rubber-stamp the PR if I were primary reviewer.