DecodeError: 'utf-8' codec can't decode byte 0x82 in position 1598601: invalid start byte when using PersistentDataset
See original GitHub issueHi,
While running a training that uses PersistentDataset, I’m getting a unicode error while it tries to load the cache files. This happens after a substantial number of epochs, so it runs through all the files several times before issuing this error. Every time I run the script, I delete all the contents of the cache directory folder and make sure there is nothing in it before re-running. The error doesn’t always happen in the same epoch.
IMPLEMENTATION DETAILS:
- PersistentDataset is instanced with an input dictionary containing: a label and image file, a small numpy array, an integer and a string, all with separate keys. The transform is a simple Compose of LoadImageD loading specific keys (‘img’, ‘label’) of NPZ files saved locally.
- While looping into this dataset, the cached files are created accordingly during the first epoch, and accessed during several epochs (~50, 100) until it errors out.
I wasn’t obtaining this error with older versions of Monai. Now it’s 0.8.1
Trace
for i, item in enumerate(self.source): File “######/data/spadenai_v2_sliced.py”, line 135, in getIteratorFun for volumes in dataset: File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 97, in getitem return self._transform(index) File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 364, in _transform pre_random_item = self._cachecheck(self.data[index]) File “######/venv/lib/python3.8/site-packages/monai/data/dataset.py”, line 330, in _cachecheck return torch.load(hashfile) File “######/venv/lib/python3.8/site-packages/torch/serialization.py”, line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File “######/venv/lib/python3.8/site-packages/torch/serialization.py”, line 882, in _load result = unpickler.load() UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x82 in position 1598601: invalid start byte
Environment This occurs in Ubuntu 18.04.6 LTS.
python -c 'import monai; monai.config.print_debug_info()'
Output:
Printing MONAI config…
MONAI version: 0.8.0 Numpy version: 1.19.4 Pytorch version: 1.10.1+cu102 MONAI flags: HAS_EXT = False, USE_COMPILED = False MONAI rev id: 714d00dffe6653e21260160666c4c201ab66511b
Optional dependencies: Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 3.0.2 scikit-image version: 0.16.2 Pillow version: 7.1.2 Tensorboard version: 2.3.0 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: 0.11.2+cu102 tqdm version: 4.62.3 lmdb version: NOT INSTALLED or UNKNOWN VERSION. psutil version: NOT INSTALLED or UNKNOWN VERSION. pandas version: 1.0.1 einops version: 0.3.2 transformers version: NOT INSTALLED or UNKNOWN VERSION. mlflow version: 1.22.0
For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================ Printing system config…
psutil
required for print_system_info
================================ Printing GPU config…
Num GPUs: 1 Has CUDA: True CUDA version: 10.2 cuDNN enabled: True cuDNN version: 7605 Current device: 0 Library compiled for CUDA architectures: [‘sm_37’, ‘sm_50’, ‘sm_60’, ‘sm_70’] GPU 0 Name: Quadro RTX 8000 GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 72 GPU 0 Total memory (GB): 47.5 GPU 0 CUDA capability (maj.min): 7.5
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi @virginiafdez ,
As I said in the previous comment, you can change the protocol by setting the MONAI dataset arg: https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/dataset.py#L207 Please have a try first, if issue still exists, let’s analyze further.
Thanks.
Cool, glad to see that.
Thanks.