question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml with mnist_784 uses excessive memory

See original GitHub issue
from sklearn.datasets import fetch_openml
fetch_openml(name="mnist_784")

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don’t understand why the function uses so much memory.

This has caused numerous people to have memory errors in the past:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:16 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
lestevecommented, Dec 8, 2021

The high memory-usage only happens for as_frame=True and not as_frame=False, so there is likely some inefficiency in the code creating the dataframe. In the snippet below as_frame=True uses ~3.5GB and as_frame=False uses ~800MB.

In [1]: from sklearn.datasets import fetch_openml
   ...: 
   ...: %load_ext memory_profiler
   ...: print('as_frame=True')
   ...: %memit fetch_openml(name="mnist_784", as_frame=True)
   ...: 
   ...: print('as_frame=False')
   ...: %memit fetch_openml(name="mnist_784", as_frame=False)
   ...: 
as_frame=True
peak memory: 3572.36 MiB, increment: 3477.56 MiB
as_frame=False
peak memory: 1396.97 MiB, increment: 783.14 MiB
1reaction
glemaitrecommented, Dec 9, 2021

I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than as_frame=False but getting a dataframe:

In [8]:  %memit arff_reader_via_pandas("mnist_784.arff")
peak memory: 1212.86 MiB, increment: 483.52 MiB

The rough implementation is there:

# %%
def _strip_quotes(string):
    for quotes_char in ["'", '"']:
        if string.startswith(quotes_char) and string.endswith(quotes_char):
            string = string[1:-1]
    return string


# %%
def _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes):
    import pandas as pd
    dtypes = {}
    for feature_name in df.columns:
        pd_dtype = df[feature_name].dtype
        arff_dtype = arff_dtypes[feature_name]
        if arff_dtype.lower() in ("numeric", "real", "integer"):
            # pandas will properly parse numerical values
            dtypes[feature_name] = pd_dtype
        elif arff_dtype.startswith("{") and arff_dtype.endswith("}"):
            categories = arff_dtype[1:-1].split(",")
            categories = [_strip_quotes(category) for category in categories]
            if pd_dtype.kind == "i":
                categories = [int(category) for category in categories]
            elif pd_dtype.kind == "f":
                categories = [float(category) for category in categories]
            dtypes[feature_name] = pd.CategoricalDtype(categories)
        else:
            dtypes[feature_name] = pd_dtype
    return dtypes


# %%
line_tag_data = 0

columns = []
arff_dtypes = {}
with open(filename, "r") as f:
    for idx_line, line in enumerate(f):
        if line.lower().startswith("@attribute"):
            _, feature_name, feature_type = line.split()
            feature_name = _strip_quotes(feature_name)
            columns.append(feature_name)
            arff_dtypes[feature_name] = feature_type
        if line.lower().startswith("@data"):
            line_tag_data = idx_line
            break


# %%
def arff_reader_via_pandas(filename):
    import pandas as pd

    df = pd.read_csv(
        filename,
        skiprows=line_tag_data + 1,
        header=None,
        na_values=["?"],
    )
    df.columns = columns
    dtypes = _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes)
    df = df.astype(dtypes)

    return df
Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.datasets.fetch_openml
OpenML ID of the dataset. The most specific way of retrieving a dataset. If data_id is not given, name (and potential version) are...
Read more >
Cannot load MNIST Original dataset using fetch_openml in ...
Method fetch_openml() download dataset from mldata.org which is not stable and can not connect. An alternative way is manually to download ...
Read more >
Optimizing Memory Usage Of Scikit-Learn Models Using ... - Zyte
On my machine, the loaded vectorizer uses about 82MB of memory in this case. If we add bigrams (by using CountVectorizer(ngram_range=(1,2))) then it...
Read more >
RAM Problem When Using MNIST Dataset - Reddit
from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784', version=1). Somehow in the process, my RAM usage has skyrocketed ...
Read more >
mnist_784 - OpenML
The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found