Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Other datatype for LabelMap than float32

See original GitHub issue

🚀 Feature I noticed that a LabelMap and an IntensityImage are both saved as float32 tensors, which means that the LabelMap uses a lot more memory than needed. This is because this piece of code in io.py which casts all image to float32:

def _read_sitk(path: TypePath) -> Tuple[torch.Tensor, np.ndarray]:
    if Path(path).is_dir():  # assume DICOM
        image = _read_dicom(path)
    else:
        image = sitk.ReadImage(str(path))
    data, affine = sitk_to_nib(image, keepdim=True)
    if data.dtype != np.float32:
        data = data.astype(np.float32)
    tensor = torch.from_numpy(data)
    return tensor, affine

Is there a reason for this .astype(np.float32)?

This can be made a lot more memory friendly by removing this cast and storing segmentations in memory as uint8 for example. Also I expect spatial augmentations which requires resampling to be a lot faster when they work with uint8 instead of float32

Motivation

Better use of memory
Faster augmentations which require resampling

Pitch

No cast to float32 for all tensors, allowing different dtypes

Could these two lines be removed? All tests still pass when I comment them out. Maybe only cast bool to np.uint8 because SimpleITK does not support bool?

 if data.dtype != np.float32:
        data = data.astype(np.float32)

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

fepegarcommented, Dec 10, 2020

@romainVala I think the proposal is not really forcing a specific type, but stopping forcing everything to be float 32. So your partial volume maps (which maybe shouldn’t be instantiated as a label map, as they don’t contain categorical labels) would still be processed fine.

1reaction

fepegarcommented, Dec 10, 2020

I just tried this code

import numpy as np
import SimpleITK as sitk

array = 256 * np.random.rand(256, 256, 180)

im_float = sitk.GetImageFromArray(array.astype(np.float32))
im_char = sitk.GetImageFromArray(array.astype(np.uint8))

transform = sitk.Euler3DTransform()
transform.SetRotation(10, 20, 30)

And then these:

In [2]: %timeit sitk.Resample(im_float, transform)                                                                                                      
20.2 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit sitk.Resample(im_char, transform)                                                                                                       
15.2 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So you’re right, it’s faster in uint8. I did this because some transforms required float and so I just transformed everything to float. Another reason is that having a consistent data type, everything works smoothly with a data loader:

In [9]: import torch 
   ...: import numpy as np 
   ...:  
   ...: class Dataset: 
   ...:     def __len__(self): 
   ...:         return 10 
   ...:      
   ...:     def __getitem__(self, i): 
   ...:         x = 10 * np.random.rand(10) 
   ...:         if i % 2: 
   ...:             x = x.astype(np.uint8) 
   ...:         return x 
   ...:  
   ...: loader = torch.utils.data.DataLoader(Dataset(), batch_size=5) 
   ...: next(iter(loader))

[...]

~/miniconda3/envs/episurg/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     54             storage = elem.storage()._new_shared(numel)
     55             out = elem.new(storage)
---> 56         return torch.stack(batch, 0, out=out)
     57     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
     58             and elem_type.__name__ != 'string_':

RuntimeError: Expected object of scalar type Double but got scalar type Byte for sequence element 1 in sequence argument at position #1 'tensors'

The tests probably pass because they typically use images that are created in float 32 (and obviously because they’re not complete enough).

I agree that saving in float by default is not good. There should be at least a kwarg for the dtype.

So what do you think? I suppose there could be a Cast transform that could be used before a data loader (and by transforms that need float) but this would be quite backwards incompatible. But if it makes the library way faster, might be a good thing to do.

Top Results From Across the Web

dvid/labelmap.go at master · janelia-flyem/dvid - GitHub

Package labelmap handles both volumes of label data as well as indexing to. quickly find and generate sparse volumes of any particular label....

Data Types and Formats – Data Analysis and Visualization in ...

If we divide one integer by another, we get a float. The result on Python 3 is different than in Python 2, where...

TypeError: object of type 'numpy.float32' has no len()

I can try this, I just don't see why it would be giving me this error on my custom model and not every...

Basic Data Types in Python

Learn the basic data types that are built into Python, like numbers, strings, and Booleans. You'll also get an overview of Python's built-in...

Data types — NumPy v1.24 Manual

There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex. Those with numbers ...