Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dtype of tensors should be preserved

See original GitHub issue

After switching to datasets my model just broke. After a weekend of debugging, the issue was that my model could not handle the double that the Dataset provided, as it expected a float (but didn’t give a warning, which seems a PyTorch issue).

As a user I did not expect this bug. I have a map function that I call on the Dataset that looks like this:

def preprocess(sentences: List[str]):
    token_ids = [[vocab.to_index(t) for t in s.split()] for s in sentences]

    sembeddings = stransformer.encode(sentences)
    print(sembeddings.dtype)
    return {"input_ids": token_ids, "sembedding": sembeddings}

Given a list of sentences (List[str]), it converts those into token_ids on the one hand (list of lists of ints; List[List[int]]) and into sentence embeddings on the other (Tensor of dtype torch.float32). That means that I actually set the column “sembedding” to a tensor that I as a user expect to be a float32.

It appears though that behind the scenes, this tensor is converted into a list. I did not find this documented anywhere but I might have missed it. From a user’s perspective this is incredibly important though, because it means you cannot do any data_type or tensor casting yourself in a mapping function! Furthermore, this can lead to issues, as was my case.

My model expected float32 precision, which I thought sembedding was because that is what stransformer.encode outputs. But behind the scenes this tensor is first cast to a list, and when we then set its format, as below, this column is cast not to float32 but to double precision float64.

dataset.set_format(type="torch", columns=["input_ids", "sembedding"])

This happens because apparently there is an intermediate step of casting to a numpy array (?) whose dtype creation/deduction is different from torch dtypes (see the snippet below). As you can see, this means that the dtype is not preserved: if I got it right, the dataset goes from torch.float32 -> list -> float64 (numpy) -> torch.float64.

import torch
import numpy as np

l = [-0.03010837361216545, -0.035979013890028, -0.016949838027358055]
torch_tensor = torch.tensor(l)
np_array = np.array(l)
np_to_torch = torch.from_numpy(np_array)

print(torch_tensor.dtype)
# torch.float32
print(np_array.dtype)
# float64
print(np_to_torch.dtype)
# torch.float64

This might lead to unwanted behaviour. I understand that the whole library is probably built around casting from numpy to other frameworks, so this might be difficult to solve. Perhaps set_format should include a dtypes option where for each input column the user can specify the wanted precision.

The alternative is that the user needs to cast manually after loading data from the dataset but that does not seem user-friendly, makes the dataset less portable, and might use more space in memory as well as on disk than is actually needed.

Issue Analytics

State:
Created 3 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

lhoestqcommented, Feb 11, 2021

Reopening since @bhavitvyamalik started looking into it !

Also I’m posting here a function that could be helpful to support preserving the dtype of tensors.

It’s used to build a pyarrow array out of a numpy array and:

it doesn’t convert the numpy array to a python list
it keeps the precision of the numpy array for the pyarrow array
it works with multidimensional arrays (while pa.array can only take a 1D array as input)
it builds the pyarrow ListArray from offsets created on-the-fly and values that come from the flattened numpy array

from functools import reduce
from operator import mul

import numpy as np
import pyarrow as pa

def pa_ndarray(a):
    """Build a PyArrow ListArray from a multidimensional NumPy array"""
    values = pa.array(a.flatten()) 
    for i in range(a.ndim - 1): 
        n_offsets = reduce(mul, a.shape[:a.ndim - i - 1], 1) 
        step_offsets = a.shape[a.ndim - i - 1] 
        offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32()) 
        values = pa.ListArray.from_arrays(offsets, values) 
    return values 

narr = np.arange(42).reshape(7, 2, 3).astype(np.uint8)
parr = pa_ndarray(narr)
assert isinstance(parr, pa.Array)
assert parr.type == pa.list_(pa.list_(pa.uint8()))
assert narr.tolist() == parr.to_pylist()

The only costly operation is the offsets computations. Since it doesn’t iterate on the numpy array values this function is pretty fast.

0reactions

lhoestqcommented, May 10, 2021

Hi !

It would be awesome to achieve this speed for numpy arrays ! For now we have to use encode_nested_example to convert numpy arrays to python lists since pyarrow doesn’t support multidimensional numpy arrays (only 1D).

Maybe let’s start a new PR from your PR @bhavitvyamalik (idk why we didn’t answer your PR at that time, sorry about that). Basically the idea is to allow TypedSequence to support numpy arrays as you did, and remove the numpy->python casting in _cast_to_python_objects.

This is really important since we are starting to have a focus on other modalities than text as well (audio, images).

Though until then @samgd, there is another feature that may interest you and that may give you the speed you want:

In a dataset script you can subclass either a GeneratorBasedBuilder (with the _generate_examples method) or an ArrowBasedBuilder if you want. the ArrowBasedBuilder allows to yield arrow data by implementing the _generate_tables method (it’s the same as _generate_examples except you must yield arrow tables). Since the data are already in arrow format, it doesn’t call encode_nested_example. Let me know if that helps.

Top Results From Across the Web

Tensor Attributes — PyTorch 1.13 documentation

dtype is a floating point data type, the property is_floating_point can be used, which returns True if the data type is a floating...

torch.Tensor — PyTorch master documentation

A tensor of specific data type can be constructed by passing a torch.dtype and/or a torch.device to a constructor or tensor creation op:....

PyTorch: difference between type(a), a.type, a.type()

Tensor.type ( x.type() ) is pytorch in-built method. It will return type of data stored inside tensor. like torch.DoubleTensor , etc.

tf.Tensor | TensorFlow v2.11.0

dtype, A DType . Type of elements stored in this tensor. ... In eager execution (or within tf.function ) you do not need...

Introduction to Tensors and Variables - | notebook.community

You can see all supported dtypes at tf.dtypes.DType. If you're familiar with NumPy, tensors are (kind of) like np.arrays. All tensors are immutable...