question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Calls to map are not cached.

See original GitHub issue

Describe the bug

Somehow caching does not work for me anymore. Am I doing something wrong, or is there anything that I missed?

Steps to reproduce the bug


import datasets
datasets.set_caching_enabled(True)
sst = datasets.load_dataset("sst")

def foo(samples, i):
    print("executed", i[:10])
    return samples

# first call
x = sst.map(foo, batched=True, with_indices=True,  num_proc=2)

print('\n'*3, "#" * 30, '\n'*3)

# second call
y = sst.map(foo, batched=True, with_indices=True, num_proc=2)

# print version
import sys
import platform
print(f"""
- Datasets: {datasets.__version__}
- Python: {sys.version}
- Platform: {platform.platform()}
""")

Actual results

This code prints the following output for me:

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
#0:   0%|          | 0/5 [00:00<?, ?ba/s]
#1:   0%|          | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 59.85ba/s]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 60.85ba/s]
#0:   0%|          | 0/1 [00:00<?, ?ba/s]
#1:   0%|          | 0/1 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 69.32ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 70.93ba/s]
#0:   0%|          | 0/2 [00:00<?, ?ba/s]
#1:   0%|          | 0/2 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 63.25ba/s]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 57.69ba/s]



 ############################## 



#0:   0%|          | 0/5 [00:00<?, ?ba/s]
#1:   0%|          | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 58.10ba/s]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 5/5 [00:00<00:00, 57.19ba/s]
#0:   0%|          | 0/1 [00:00<?, ?ba/s]
#1:   0%|          | 0/1 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 60.10ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 53.82ba/s]
#0:   0%|          | 0/2 [00:00<?, ?ba/s]
#1:   0%|          | 0/2 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
#0: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 72.76ba/s]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 71.55ba/s]

- Datasets: 1.6.1
- Python: 3.8.3 (default, May 19 2020, 18:47:26) 
[GCC 7.3.0]
- Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.10

Expected results

Caching should work.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
mariosaskocommented, May 5, 2021

Hi,

set keep_in_memory to False when loading a dataset (sst = load_dataset("sst", keep_in_memory=False)) to prevent it from loading in-memory. Currently, in-memory datasets fail to find cached files due to this check (always False for them):

https://github.com/huggingface/datasets/blob/241a0b4a3a868778ee91e767ad406f9da7610df2/src/datasets/arrow_dataset.py#L1718

@albertvillanova It seems like this behavior was overlooked in #2182.

0reactions
albertvillanovacommented, Jun 8, 2021

Please @villmow, feel free to update to Datasets latest version (1.8).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Map Drives not working on cache profile - TechNet - Microsoft
In some instances though we have cases where say User A logs into his/her already cached profile and the map drives will not...
Read more >
Why don't common Map implementations cache the result of ...
Because caching assumes a particular use-case but will actually slow things down in others. It also adds a lot of complications.
Read more >
Caching a dataset with map() when loaded with from_dict()
For my specific use-case, I create the dataset using the .from_dict() method. I then process the dataset using .map() using the main process...
Read more >
What is map caching?β€”ArcGIS Server
If the data you see on the map needs to be live, with no time delay acceptable, caching is not appropriate. However, if...
Read more >
What is a cached map service? - Esri Support
A cached map service is a regular map service that has been enhanced to serve maps very quickly using a cache of static...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found