question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DecodeError: 'utf-8' codec can't decode byte 0x82 in position 1598601: invalid start byte when using PersistentDataset

See original GitHub issue

Hi,

While running a training that uses PersistentDataset, I’m getting a unicode error while it tries to load the cache files. This happens after a substantial number of epochs, so it runs through all the files several times before issuing this error. Every time I run the script, I delete all the contents of the cache directory folder and make sure there is nothing in it before re-running. The error doesn’t always happen in the same epoch.

IMPLEMENTATION DETAILS:

  • PersistentDataset is instanced with an input dictionary containing: a label and image file, a small numpy array, an integer and a string, all with separate keys. The transform is a simple Compose of LoadImageD loading specific keys (‘img’, ‘label’) of NPZ files saved locally.
  • While looping into this dataset, the cached files are created accordingly during the first epoch, and accessed during several epochs (~50, 100) until it errors out.

I wasn’t obtaining this error with older versions of Monai. Now it’s 0.8.1

Trace

for i, item in enumerate(self.source): File “######/data/spadenai_v2_sliced.py”, line 135, in getIteratorFun for volumes in dataset: File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 97, in getitem return self._transform(index) File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 364, in _transform pre_random_item = self._cachecheck(self.data[index]) File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 330, in _cachecheck return torch.load(hashfile) File “######/venv/lib/python3.8/site-packages/torch/serialization.py”, line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File “######/venv/lib/python3.8/site-packages/torch/serialization.py”, line 882, in _load result = unpickler.load() UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x82 in position 1598601: invalid start byte

Environment This occurs in Ubuntu 18.04.6 LTS.

python -c 'import monai; monai.config.print_debug_info()'

Output:

Printing MONAI config…

MONAI version: 0.8.0 Numpy version: 1.19.4 Pytorch version: 1.10.1+cu102 MONAI flags: HAS_EXT = False, USE_COMPILED = False MONAI rev id: 714d00dffe6653e21260160666c4c201ab66511b

Optional dependencies: Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 3.0.2 scikit-image version: 0.16.2 Pillow version: 7.1.2 Tensorboard version: 2.3.0 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: 0.11.2+cu102 tqdm version: 4.62.3 lmdb version: NOT INSTALLED or UNKNOWN VERSION. psutil version: NOT INSTALLED or UNKNOWN VERSION. pandas version: 1.0.1 einops version: 0.3.2 transformers version: NOT INSTALLED or UNKNOWN VERSION. mlflow version: 1.22.0

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================ Printing system config…

psutil required for print_system_info

================================ Printing GPU config…

Num GPUs: 1 Has CUDA: True CUDA version: 10.2 cuDNN enabled: True cuDNN version: 7605 Current device: 0 Library compiled for CUDA architectures: [‘sm_37’, ‘sm_50’, ‘sm_60’, ‘sm_70’] GPU 0 Name: Quadro RTX 8000 GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 72 GPU 0 Total memory (GB): 47.5 GPU 0 CUDA capability (maj.min): 7.5

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Nic-Macommented, May 18, 2022

Hi @virginiafdez ,

As I said in the previous comment, you can change the protocol by setting the MONAI dataset arg: https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/dataset.py#L207 Please have a try first, if issue still exists, let’s analyze further.

Thanks.

0reactions
Nic-Macommented, May 24, 2022

Cool, glad to see that.

Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

utf-8 codec can t decode byte 0x82 in position 16 invalid start ...
The error is because there is some non-ASCII character and it can't be encoded/decoded. One simple way to avoid this error is to...
Read more >
'utf8' codec can't decode byte 0xa5 in position 0: invalid start ...
I switched this simply by defining a different codec package in the read_csv() command: encoding = 'unicode_escape'.
Read more >
UnicodeDecodeError: "utf-8" codec can't decode byte ... - GitHub
UnicodeDecodeError: "utf-8" codec can't decode byte in position : invalid ... I generate my own dataset .tfrecord with that modify code:.
Read more >
'utf-8' codec can't decode byte 0x93 in position 364 - Intellipaat
'utf-8' codec can't decode byte 0x93 in position 364: invalid start byte. my codes. df = pd.read_csv(path, encoding = 'unicode_escape').
Read more >
Workaround for error 'utf-8' codec can't decode byte 0x82 ...
... containing French characters using Azure CLI. The message was: 'utf-8' codec can't decode byte 0x82 in position 1197: invalid start byte ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found