Calls to map are not cached.
See original GitHub issueDescribe the bug
Somehow caching does not work for me anymore. Am I doing something wrong, or is there anything that I missed?
Steps to reproduce the bug
import datasets
datasets.set_caching_enabled(True)
sst = datasets.load_dataset("sst")
def foo(samples, i):
print("executed", i[:10])
return samples
# first call
x = sst.map(foo, batched=True, with_indices=True, num_proc=2)
print('\n'*3, "#" * 30, '\n'*3)
# second call
y = sst.map(foo, batched=True, with_indices=True, num_proc=2)
# print version
import sys
import platform
print(f"""
- Datasets: {datasets.__version__}
- Python: {sys.version}
- Platform: {platform.platform()}
""")
Actual results
This code prints the following output for me:
No config specified, defaulting to: sst/default
Reusing dataset sst (/home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
#0: 0%| | 0/5 [00:00<?, ?ba/s]
#1: 0%| | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|ββββββββββ| 5/5 [00:00<00:00, 59.85ba/s]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|ββββββββββ| 5/5 [00:00<00:00, 60.85ba/s]
#0: 0%| | 0/1 [00:00<?, ?ba/s]
#1: 0%| | 0/1 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|ββββββββββ| 1/1 [00:00<00:00, 69.32ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|ββββββββββ| 1/1 [00:00<00:00, 70.93ba/s]
#0: 0%| | 0/2 [00:00<?, ?ba/s]
#1: 0%| | 0/2 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
#0: 100%|ββββββββββ| 2/2 [00:00<00:00, 63.25ba/s]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|ββββββββββ| 2/2 [00:00<00:00, 57.69ba/s]
##############################
#0: 0%| | 0/5 [00:00<?, ?ba/s]
#1: 0%| | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|ββββββββββ| 5/5 [00:00<00:00, 58.10ba/s]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|ββββββββββ| 5/5 [00:00<00:00, 57.19ba/s]
#0: 0%| | 0/1 [00:00<?, ?ba/s]
#1: 0%| | 0/1 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|ββββββββββ| 1/1 [00:00<00:00, 60.10ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|ββββββββββ| 1/1 [00:00<00:00, 53.82ba/s]
#0: 0%| | 0/2 [00:00<?, ?ba/s]
#1: 0%| | 0/2 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
#0: 100%|ββββββββββ| 2/2 [00:00<00:00, 72.76ba/s]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|ββββββββββ| 2/2 [00:00<00:00, 71.55ba/s]
- Datasets: 1.6.1
- Python: 3.8.3 (default, May 19 2020, 18:47:26)
[GCC 7.3.0]
- Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.10
Expected results
Caching should work.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Map Drives not working on cache profile - TechNet - Microsoft
In some instances though we have cases where say User A logs into his/her already cached profile and the map drives will not...
Read more >Why don't common Map implementations cache the result of ...
Because caching assumes a particular use-case but will actually slow things down in others. It also adds a lot of complications.
Read more >Caching a dataset with map() when loaded with from_dict()
For my specific use-case, I create the dataset using the .from_dict() method. I then process the dataset using .map() using the main process...
Read more >What is map caching?βArcGIS Server
If the data you see on the map needs to be live, with no time delay acceptable, caching is not appropriate. However, if...
Read more >What is a cached map service? - Esri Support
A cached map service is a regular map service that has been enhanced to serve maps very quickly using a cache of static...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi,
set
keep_in_memory
to False when loading a dataset (sst = load_dataset("sst", keep_in_memory=False)
) to prevent it from loading in-memory. Currently, in-memory datasets fail to find cached files due to this check (always False for them):https://github.com/huggingface/datasets/blob/241a0b4a3a868778ee91e767ad406f9da7610df2/src/datasets/arrow_dataset.py#L1718
@albertvillanova It seems like this behavior was overlooked in #2182.
Please @villmow, feel free to update to
Datasets
latest version (1.8).