Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird behavior of pretrained models with CUDA

See original GitHub issue

Python: 3.6.9 CUDA: 11.2 compressai: 1.1.3 torch: 1.8.1

When replicating the RD curves in README for kodak dataset (wget http://r0k.us/graphics/kodak/kodak/kodim{0,1,2}{0,1,2,3,4,5,6,7,8,9}.png in ./kodak/), I observed a weird behavior of pretrained models with CUDA.

Looks like working well without CUDA

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric mse --quality 5

metric = mse

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-5-f8b614e1.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-5-f8b614e1.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      34.52624269077672
    ],
    "ms-ssim": [
      0.9835608204205831
    ],
    "bpp": [
      0.6686842176649305
    ],
    "encoding_time": [
      0.2404747505982717
    ],
    "decoding_time": [
      0.5095066924889883
    ]
  }
}

metric = ms-ssim

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric ms-ssim --quality 5

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      28.992422918554542
    ],
    "ms-ssim": [
      0.9866020356615385
    ],
    "bpp": [
      0.47353786892361116
    ],
    "encoding_time": [
      0.24171670277913412
    ],
    "decoding_time": [
      0.5283569494883219
    ]
  }
}

PSNR and MS-SSIM are both NaN when using CUDA

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric mse --quality 5 --cuda

metric = mse

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-5-f8b614e1.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-5-f8b614e1.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      NaN
    ],
    "ms-ssim": [
      NaN
    ],
    "bpp": [
      0.6686876085069443
    ],
    "encoding_time": [
      0.034142365058263145
    ],
    "decoding_time": [
      0.025616129239400227
    ]
  }
}

python -m compressai.utils.eval_model pretrained ./kodak/ -a bmshj2018-hyperprior --metric ms-ssim --quality 5 --cuda

metric = ms-ssim

Downloading: "https://compressai.s3.amazonaws.com/models/v1/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar" to /home/yoshitom/.cache/torch/hub/checkpoints/bmshj2018-hyperprior-ms-ssim-5-c34afc8d.pth.tar
100.0%
{
  "name": "bmshj2018-hyperprior",
  "description": "Inference (ans)",
  "results": {
    "psnr": [
      NaN
    ],
    "ms-ssim": [
      NaN
    ],
    "bpp": [
      0.47353786892361116
    ],
    "encoding_time": [
      0.03800355394681295
    ],
    "decoding_time": [
      0.029240707556406658
    ]
  }
}

I didn’t check all the combinations (model, quality, metrics, with/without CUDA), but at least bmshj2018-hyperprior with quality=8 (besides one with quality=5) also returned NaN when using CUDA (for both mse and ms-ssim checkpoints). There may be more checkpoints that face the same issue.

When I checked the output from a model (i.e., out_dec["x_hat"]), some value in the tensor is NaN when using CUDA and that must have caused this issue.

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

jbegaintcommented, May 5, 2021

thanks for testing!

0reactions

yoshitomo-matsubaracommented, May 5, 2021

Hi @jbegaint, I fetched the master branch and tried the above configs for bmshj2018-hyperprior with/without CUDA. The results with CUDA look same as those without CUDA. Thank you for the fix!