question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird behavior of pretrained models with CUDA

See original GitHub issue

Python: 3.6.9 CUDA: 11.2 compressai: 1.1.3 torch: 1.8.1

When replicating the RD curves in README for kodak dataset (wget http://r0k.us/graphics/kodak/kodak/kodim{0,1,2}{0,1,2,3,4,5,6,7,8,9}.png in ./kodak/), I observed a weird behavior of pretrained models with CUDA.

Looks like working well without CUDA

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric mse --quality 5

metric = mse

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-5-f8b614e1.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-5-f8b614e1.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      34.52624269077672
    ],
    "ms-ssim": [
      0.9835608204205831
    ],
    "bpp": [
      0.6686842176649305
    ],
    "encoding_time": [
      0.2404747505982717
    ],
    "decoding_time": [
      0.5095066924889883
    ]
  }
}

metric = ms-ssim

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric ms-ssim --quality 5

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      28.992422918554542
    ],
    "ms-ssim": [
      0.9866020356615385
    ],
    "bpp": [
      0.47353786892361116
    ],
    "encoding_time": [
      0.24171670277913412
    ],
    "decoding_time": [
      0.5283569494883219
    ]
  }
}

PSNR and MS-SSIM are both NaN when using CUDA

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric mse --quality 5 --cuda

metric = mse

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-5-f8b614e1.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-5-f8b614e1.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      NaN
    ],
    "ms-ssim": [
      NaN
    ],
    "bpp": [
      0.6686876085069443
    ],
    "encoding_time": [
      0.034142365058263145
    ],
    "decoding_time": [
      0.025616129239400227
    ]
  }
}

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric ms-ssim --quality 5 --cuda

metric = ms-ssim

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      NaN
    ],
    "ms-ssim": [
      NaN
    ],
    "bpp": [
      0.47353786892361116
    ],
    "encoding_time": [
      0.03800355394681295
    ],
    "decoding_time": [
      0.029240707556406658
    ]
  }
}

I didn’t check all the combinations (model, quality, metrics, with/without CUDA), but at least bmshj2018-hyperprior with quality=8 (besides one with quality=5) also returned NaN when using CUDA (for both mse and ms-ssim checkpoints). There may be more checkpoints that face the same issue.

When I checked the output from a model (i.e., out_dec["x_hat"]), some value in the tensor is NaN when using CUDA and that must have caused this issue.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jbegaintcommented, May 5, 2021

thanks for testing!

0reactions
yoshitomo-matsubaracommented, May 5, 2021

Hi @jbegaint, I fetched the master branch and tried the above configs for bmshj2018-hyperprior with/without CUDA. The results with CUDA look same as those without CUDA. Thank you for the fix!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training on a pre-trained model: RuntimeError: CUDA error ...
I am launching training on a pretrained model and a 2 classes coco like dataset. To Reproduce. Steps to reproduce the behavior: Run...
Read more >
Handling big models - Hugging Face
The model parallelism used when your model is split on several GPUs is naive and not optimized, meaning that only one GPU works...
Read more >
Different output on different cuda device for FasterRCNN ...
I am trying out pretrained faster-rcnn model for object detection in PyTorch and observed a weird behavior on executing the following code ...
Read more >
PyTorch : Different output on different cuda device for ...
I am trying out pretrained faster-rcnn model for object detection in PyTorch and observed a weird behavior on executing the following code ...
Read more >
NVIDIA NGC Pretrained Models
What Are Pretrained AI Models? AI and machine learning models are built on mathematical algorithms and are trained using data and human expertise....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found