dtype of tensors should be preserved
See original GitHub issueAfter switching to datasets
my model just broke. After a weekend of debugging, the issue was that my model could not handle the double that the Dataset provided, as it expected a float (but didn’t give a warning, which seems a PyTorch issue).
As a user I did not expect this bug. I have a map
function that I call on the Dataset that looks like this:
def preprocess(sentences: List[str]):
token_ids = [[vocab.to_index(t) for t in s.split()] for s in sentences]
sembeddings = stransformer.encode(sentences)
print(sembeddings.dtype)
return {"input_ids": token_ids, "sembedding": sembeddings}
Given a list of sentences
(List[str]
), it converts those into token_ids on the one hand (list of lists of ints; List[List[int]]
) and into sentence embeddings on the other (Tensor of dtype torch.float32
). That means that I actually set the column “sembedding” to a tensor that I as a user expect to be a float32.
It appears though that behind the scenes, this tensor is converted into a list. I did not find this documented anywhere but I might have missed it. From a user’s perspective this is incredibly important though, because it means you cannot do any data_type or tensor casting yourself in a mapping function! Furthermore, this can lead to issues, as was my case.
My model expected float32 precision, which I thought sembedding
was because that is what stransformer.encode
outputs. But behind the scenes this tensor is first cast to a list, and when we then set its format, as below, this column is cast not to float32 but to double precision float64.
dataset.set_format(type="torch", columns=["input_ids", "sembedding"])
This happens because apparently there is an intermediate step of casting to a numpy array (?) whose dtype creation/deduction is different from torch dtypes (see the snippet below). As you can see, this means that the dtype is not preserved: if I got it right, the dataset goes from torch.float32 -> list -> float64 (numpy) -> torch.float64.
import torch
import numpy as np
l = [-0.03010837361216545, -0.035979013890028, -0.016949838027358055]
torch_tensor = torch.tensor(l)
np_array = np.array(l)
np_to_torch = torch.from_numpy(np_array)
print(torch_tensor.dtype)
# torch.float32
print(np_array.dtype)
# float64
print(np_to_torch.dtype)
# torch.float64
This might lead to unwanted behaviour. I understand that the whole library is probably built around casting from numpy to other frameworks, so this might be difficult to solve. Perhaps set_format
should include a dtypes
option where for each input column the user can specify the wanted precision.
The alternative is that the user needs to cast manually after loading data from the dataset but that does not seem user-friendly, makes the dataset less portable, and might use more space in memory as well as on disk than is actually needed.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (6 by maintainers)
Reopening since @bhavitvyamalik started looking into it !
Also I’m posting here a function that could be helpful to support preserving the dtype of tensors.
It’s used to build a pyarrow array out of a numpy array and:
pa.array
can only take a 1D array as input)The only costly operation is the offsets computations. Since it doesn’t iterate on the numpy array values this function is pretty fast.
Hi !
It would be awesome to achieve this speed for numpy arrays ! For now we have to use
encode_nested_example
to convert numpy arrays to python lists since pyarrow doesn’t support multidimensional numpy arrays (only 1D).Maybe let’s start a new PR from your PR @bhavitvyamalik (idk why we didn’t answer your PR at that time, sorry about that). Basically the idea is to allow
TypedSequence
to support numpy arrays as you did, and remove the numpy->python casting in_cast_to_python_objects
.This is really important since we are starting to have a focus on other modalities than text as well (audio, images).
Though until then @samgd, there is another feature that may interest you and that may give you the speed you want:
In a dataset script you can subclass either a GeneratorBasedBuilder (with the
_generate_examples
method) or an ArrowBasedBuilder if you want. the ArrowBasedBuilder allows to yield arrow data by implementing the_generate_tables
method (it’s the same as_generate_examples
except you must yield arrow tables). Since the data are already in arrow format, it doesn’t callencode_nested_example
. Let me know if that helps.