Failure to hash (and cache) a `.map(...)` (almost always) - using this method can produce incorrect results
See original GitHub issueDescribe the bug
Sometimes I get messages about not being able to hash a method:
Parameter 'function'=<function StupidDataModule._separate_speaker_id_from_dialogue at 0x7f1b27180d30> of the transform datasets.arrow_dataset.Dataset. _map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Whilst the function looks like this:
@staticmethod
def _separate_speaker_id_from_dialogue(example: arrow_dataset.Example):
speaker_id, dialogue = tuple(zip(*(example["dialogue"])))
example["speaker_id"] = speaker_id
example["dialogue"] = dialogue
return example
This is the first step in my preprocessing pipeline, but sometimes the message about failure to hash is not appearing on the first step, but then appears on a later step. This error is sometimes causing a failure to use cached data, instead of re-running all steps again.
Steps to reproduce the bug
import copy
import datasets
from datasets import arrow_dataset
def main():
dataset = datasets.load_dataset("blended_skill_talk")
res = dataset.map(method)
print(res)
def method(example: arrow_dataset.Example):
example['previous_utterance_copy'] = copy.deepcopy(example['previous_utterance'])
return example
if __name__ == '__main__':
main()
Run with:
python -m reproduce_error
Expected results
Dataset is mapped and cached correctly.
Actual results
The code outputs this at some point:
Parameter 'function'=<function method at 0x7faa83d2a160> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Environment info
datasets
version:- Platform: Ubuntu 20.04.3
- Python version: 3.9.12
- PyArrow version: 8.0.0
- Datasets version: 2.3.1
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
installing
dill<0.3.5
after installingdatasets
by pip results in dependency conflict with the version required formultiprocess
. It can be solved by installingpip install datasets "dill<0.3.5"
(simultaneously) on a clean environmentThis has been fixed in https://github.com/huggingface/datasets/pull/4516, we will do a new release soon to include the fix 😃