question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failure to hash (and cache) a `.map(...)` (almost always) - using this method can produce incorrect results

See original GitHub issue

Describe the bug

Sometimes I get messages about not being able to hash a method: Parameter 'function'=<function StupidDataModule._separate_speaker_id_from_dialogue at 0x7f1b27180d30> of the transform datasets.arrow_dataset.Dataset. _map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. Whilst the function looks like this:

@staticmethod
def _separate_speaker_id_from_dialogue(example: arrow_dataset.Example):
    speaker_id, dialogue = tuple(zip(*(example["dialogue"])))
    example["speaker_id"] = speaker_id
    example["dialogue"] = dialogue
    return example

This is the first step in my preprocessing pipeline, but sometimes the message about failure to hash is not appearing on the first step, but then appears on a later step. This error is sometimes causing a failure to use cached data, instead of re-running all steps again.

Steps to reproduce the bug

import copy
import datasets
from datasets import arrow_dataset

def main():
    dataset = datasets.load_dataset("blended_skill_talk")
    res = dataset.map(method)
    print(res)

def method(example: arrow_dataset.Example):
    example['previous_utterance_copy'] = copy.deepcopy(example['previous_utterance'])
    return example

if __name__ == '__main__':
    main()

Run with:

python -m reproduce_error

Expected results

Dataset is mapped and cached correctly.

Actual results

The code outputs this at some point: Parameter 'function'=<function method at 0x7faa83d2a160> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.

Environment info

  • datasets version:
  • Platform: Ubuntu 20.04.3
  • Python version: 3.9.12
  • PyArrow version: 8.0.0
  • Datasets version: 2.3.1

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

5reactions
DrMatterscommented, Jun 16, 2022

installing dill<0.3.5 after installing datasets by pip results in dependency conflict with the version required for multiprocess. It can be solved by installing pip install datasets "dill<0.3.5" (simultaneously) on a clean environment

1reaction
lhoestqcommented, Jun 28, 2022

This has been fixed in https://github.com/huggingface/datasets/pull/4516, we will do a new release soon to include the fix 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why Use A Doubly Linked List and HashMap for a LRU Cache ...
I understand why this method is used (quick removal/insertion at both ends, fast access in the middle). What I am failing to understand...
Read more >
Cache in moka::future - Rust - Docs.rs
Cache performs a best-effort bounding of the map using an entry replacement algorithm to determine which entries to evict when the capacity is...
Read more >
Processing data in a Dataset - Hugging Face
Dataset map method and which you can use to apply a processing function to each examples in a dataset, independently or in batch...
Read more >
How HashMap works in Java - Javarevisited
HashMap in Java works on hashing principles. It is a data structure that allows us to store object and retrieve it in constant...
Read more >
IgniteCache (Ignite 2.14.0) - Apache Ignite
Due to the nature of an atomic cache, false-positive results can be observed. For example, an attempt to check consistency under cache's loading...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found