Bug with transforms.Resize when used with transforms.ConvertImageDtype
See original GitHub issue🐛 Describe the bug
Recent releases of Torchvision and the documentations that support it seem to suggest that we can use io.read_image + transforms.ConvertImageDtype
instead of the traditional PIL.Image.read_fn + transforms.ToTensor
. However, I have found that there are two issues:
io.read_image + transforms.ConvertImageDtype
do not actually return the same tensor values asPIL + transforms.ToTensor
, even though they are supposed to provide the same functionality.- While
io.read_image + transforms.ConvertImageDtype
itself is significantly faster than using PIL, combining it with thetransforms.Resize
operation - specifically when upsampling - makes the operation much slower than the PIL alternative.
To add onto point 2, the two sets of functions I mention return the same type of tensor: torch.float
. However, applying transforms.Resize
on the tensor generated by io.read_image + transforms.ConvertImageDtype
is much slower than applying the same resize operation on the output of PIL read + transforms.ToTensor
. I can’t really understand why this happens, since both calls to Resize
are on tensors of type torch.FloatTensor
. Also, this only occurs when upsampling.
Please refer to my post on the Pytorch Forum here for the full analysis.
Versions
Collecting environment information… PyTorch version: 1.10.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.7 LTS (x86_64) GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.23
Python version: 3.9.4 (default, Apr 9 2021, 01:15:05) [GCC 5.4.0 20160609] (64-bit runtime) Python platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23 Is CUDA available: True CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce RTX 2080 Ti GPU 3: NVIDIA GeForce RTX 2080 Ti
Nvidia driver version: 465.19.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] numpy==1.21.3 [pip3] torch==1.10.0+cu113 [pip3] torchaudio==0.10.0+cu113 [pip3] torchvision==0.11.1+cu113 [conda] Could not collect
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (2 by maintainers)
Top GitHub Comments
Some info and benchmarks on this issue:
read_image
:Why “read_image tensor -> Resize (32->224)” is much slower than “CL contig tensor -> Resize (32->224)” for 1 thread ? This due to the following reason. Tensor returned by
read_image
has memory format equivalent to channels last but has 3D. In Resize we call unsqueeze(dim=0) to make it 4D. Thus input to torch.nn.functional.interpolate is 4D channels last tensor, but output constructed from input suggested format here is channels first contiguous.As output is contiguous channels first but input is channels last, there are 2 places where algorithm takes time:
auto output = output_.contiguous(channels_last_memory_format);
, https://github.com/pytorch/pytorch/blob/e3bcf64ff84f8e96839e39056b3b90d1bd1f8bbe/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L362output_.copy_(output);
: https://github.com/pytorch/pytorch/blob/e3bcf64ff84f8e96839e39056b3b90d1bd1f8bbe/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L511-L513Similar benchmark results are for downsampling 500 -> 224
If I fix locally the issue with with non-contiguous output from
read_image
the results are the following:Here is the code used for the benchmarking: https://gist.github.com/vfdev-5/8c26a109d7718035162a6d5d138b5499
To compare with currently non-contiguous output from
read_image
: