Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug with transforms.Resize when used with transforms.ConvertImageDtype

See original GitHub issue

🐛 Describe the bug

Recent releases of Torchvision and the documentations that support it seem to suggest that we can use io.read_image + transforms.ConvertImageDtype instead of the traditional PIL.Image.read_fn + transforms.ToTensor. However, I have found that there are two issues:

io.read_image + transforms.ConvertImageDtype do not actually return the same tensor values as PIL + transforms.ToTensor, even though they are supposed to provide the same functionality.
While io.read_image + transforms.ConvertImageDtype itself is significantly faster than using PIL, combining it with the transforms.Resize operation - specifically when upsampling - makes the operation much slower than the PIL alternative.

To add onto point 2, the two sets of functions I mention return the same type of tensor: torch.float. However, applying transforms.Resize on the tensor generated by io.read_image + transforms.ConvertImageDtype is much slower than applying the same resize operation on the output of PIL read + transforms.ToTensor. I can’t really understand why this happens, since both calls to Resize are on tensors of type torch.FloatTensor. Also, this only occurs when upsampling.

Please refer to my post on the Pytorch Forum here for the full analysis.

Versions

Collecting environment information… PyTorch version: 1.10.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64) GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.23

Python version: 3.9.4 (default, Apr 9 2021, 01:15:05) [GCC 5.4.0 20160609] (64-bit runtime) Python platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23 Is CUDA available: True CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce RTX 2080 Ti GPU 3: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 465.19.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.1 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.21.3 [pip3] torch==1.10.0+cu113 [pip3] torchaudio==0.10.0+cu113 [pip3] torchvision==0.11.1+cu113 [conda] Could not collect

cc @vfdev-5 @datumbox

Issue Analytics

State:
Created 2 years ago
Comments:11 (2 by maintainers)

Top GitHub Comments

2reactions

vfdev-5commented, Nov 15, 2021

Some info and benchmarks on this issue:

Upsampling and downsampling are slow if using the tensor coming from read_image:

1 threads: ------------------------------------------------------------
      read_image tensor -> Resize (32->224)               |  1168.9
      read_image tensor + contiguous -> Resize (32->224)  |   764.9
      CF contig tensor -> Resize (32->224)                |   757.3
      CL contig tensor -> Resize (32->224)                |   786.6
6 threads: ------------------------------------------------------------
      read_image tensor -> Resize (32->224)               |   986.8
      read_image tensor + contiguous -> Resize (32->224)  |   208.9
      CF contig tensor -> Resize (32->224)                |   200.0
      CL contig tensor -> Resize (32->224)                |   790.6

Times are in microseconds (us).

“read_image tensor -> Resize (32->224)” measures upsampling resize on the tensor returned by read_image. This tensor is 3D (3, 32, 32) and could be seen as channels last.
“read_image tensor + contiguous -> Resize (32->224)” measures upsampling resize on the tensor returned by read_image and on which we applied contiguous op. This tensor is 3D (3, 32, 32) and could be seen as channels first.
“CF contig tensor -> Resize (32->224)” and “CL contig tensor -> Resize (32->224)” measure upsampling resize on channels first (CF) and channels last (CL) contiguous 4D tensors, (1, 3, 32, 32).

Why “read_image tensor -> Resize (32->224)” is much slower than “CL contig tensor -> Resize (32->224)” for 1 thread ? This due to the following reason. Tensor returned by read_image has memory format equivalent to channels last but has 3D. In Resize we call unsqueeze(dim=0) to make it 4D. Thus input to torch.nn.functional.interpolate is 4D channels last tensor, but output constructed from input suggested format here is channels first contiguous.

auto t = at::rand({1, 3, 32, 32}).contiguous(at::MemoryFormat::ChannelsLast);
std::cout << t.sizes()
> [1, 3, 32, 32]
std::cout << t.suggest_memory_format()
> ChannelsLast
std::cout << (t.is_contiguous(at::MemoryFormat::ChannelsLast) ? "true" : "false");
> true
auto t0 = t[0];
auto t1 = t0.unsqueeze(0);
std::cout << t0.sizes()
> [3, 32, 32]
std::cout << t1.sizes()
> [1, 3, 32, 32]
std::cout << (t1.is_contiguous(at::MemoryFormat::ChannelsLast) ? "true" : "false");
> true
std::cout << t1.suggest_memory_format()
> Contiguous

As output is contiguous channels first but input is channels last, there are 2 places where algorithm takes time:

auto output = output_.contiguous(channels_last_memory_format);, https://github.com/pytorch/pytorch/blob/e3bcf64ff84f8e96839e39056b3b90d1bd1f8bbe/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L362
output_.copy_(output); : https://github.com/pytorch/pytorch/blob/e3bcf64ff84f8e96839e39056b3b90d1bd1f8bbe/aten/src/ATen/native/cpu/UpSampleKernel.cpp#L511-L513

Similar benchmark results are for downsampling 500 -> 224

1 threads: -------------------------------------------------------------
      read_image tensor -> Resize (500->224)                   |  1162.3
      read_image tensor + contiguous -> Resize (500->224)      |   780.1
      CF contig tensor -> Resize (500->224)                    |   774.1
      CL contig tensor -> Resize (500->224)                    |   790.0
6 threads: -------------------------------------------------------------
      read_image tensor -> Resize (500->224)                   |   958.2
      read_image tensor + contiguous -> Resize (500->224)      |   210.8
      CF contig tensor -> Resize (500->224)                    |   203.7
      CL contig tensor -> Resize (500->224)                    |   790.4

Times are in microseconds (us).

1reaction

vfdev-5commented, Nov 10, 2021

If I fix locally the issue with with non-contiguous output from read_image the results are the following:

Num threads: 6
Torch version: 1.11.0.dev20211104+cu111
Torchvision version: 0.12.0a0 (decode_jpg, decode_png modified)
PIL version: 8.3.2
[----- Benchmark reader+transformation ------]
                                     |   Time 
6 threads: -----------------------------------
      PIL->Resize->Tensor->Norm      |  1090.3
      PIL->Tensor->Resize->Norm      |   715.0
      PTH->DType->Resize->Norm       |   602.8
      JIT: PTH->DType->Resize->Norm  |   513.4

Times are in microseconds (us).

Here is the code used for the benchmarking: https://gist.github.com/vfdev-5/8c26a109d7718035162a6d5d138b5499

To compare with currently non-contiguous output from read_image:

Num threads: 6
Torch version: 1.11.0.dev20211104+cu111
Torchvision version: 0.12.0a0
PIL version: 8.3.2
[----- Benchmark reader+transformation ------]
                                     |   Time 
6 threads: -----------------------------------
      PIL->Resize->Tensor->Norm      |  1085.2
      PIL->Tensor->Resize->Norm      |   715.4
      PTH->DType->Resize->Norm       |  1327.7
      JIT: PTH->DType->Resize->Norm  |  1228.5

Times are in microseconds (us).

Top Results From Across the Web

Bug on torchvision.transforms.functional.resize?

Hi, I encountered a strange problem where my input of a torch.tensor on torchvision.transforms.functional.resize works with visual studio ...

Different results with torchvision transforms - Stack Overflow

An alternative is to use ConvertImageDtype with torch.nn.Sequential . This 'bypasses' the need for Image , and in my case it is much...

The Devil lives in the details | capeblog

Let's take a quick look on the preprocessing used for training and there ... create PIL image; Transform the image to pytorch Tensor;...

Ben Poole on Twitter: "TIL tf.image.resize != torchvision ...

That cat uses the fur shadow system I wrote! ... One of those bugs that haunt me. To add more, cv2 resize, PIL...

opencv-transforms - PyPI

A drop-in replacement for Torchvision Transforms using OpenCV. ... Large image resizes are up to 10 times faster in OpenCV.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Bug with transforms.Resize when used with transforms.ConvertImageDtype

🐛 Describe the bug

Versions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Windows torchvision-0.11.1 CUDA-10.2 wheel packages are not bundled with nvjpeg dll

Provide complete filepath to is_valid_file in make_dataset rather than only the filename