fetch_openml with mnist_784 uses excessive memory
See original GitHub issuefrom sklearn.datasets import fetch_openml
fetch_openml(name="mnist_784")
Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.
The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don’t understand why the function uses so much memory.
This has caused numerous people to have memory errors in the past:
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:16 (15 by maintainers)
Top Results From Across the Web
sklearn.datasets.fetch_openml
OpenML ID of the dataset. The most specific way of retrieving a dataset. If data_id is not given, name (and potential version) are...
Read more >Cannot load MNIST Original dataset using fetch_openml in ...
Method fetch_openml() download dataset from mldata.org which is not stable and can not connect. An alternative way is manually to download ...
Read more >Optimizing Memory Usage Of Scikit-Learn Models Using ... - Zyte
On my machine, the loaded vectorizer uses about 82MB of memory in this case. If we add bigrams (by using CountVectorizer(ngram_range=(1,2))) then it...
Read more >RAM Problem When Using MNIST Dataset - Reddit
from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784', version=1). Somehow in the process, my RAM usage has skyrocketed ...
Read more >mnist_784 - OpenML
The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The high memory-usage only happens for
as_frame=True
and notas_frame=False
, so there is likely some inefficiency in the code creating the dataframe. In the snippet belowas_frame=True
uses ~3.5GB andas_frame=False
uses ~800MB.I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than
as_frame=False
but getting a dataframe:The rough implementation is there: