Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml with mnist_784 uses excessive memory

See original GitHub issue

from sklearn.datasets import fetch_openml
fetch_openml(name="mnist_784")

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don’t understand why the function uses so much memory.

This has caused numerous people to have memory errors in the past:

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:16 (15 by maintainers)

Top GitHub Comments

2reactions

lestevecommented, Dec 8, 2021

The high memory-usage only happens for as_frame=True and not as_frame=False, so there is likely some inefficiency in the code creating the dataframe. In the snippet below as_frame=True uses ~3.5GB and as_frame=False uses ~800MB.

In [1]: from sklearn.datasets import fetch_openml
   ...: 
   ...: %load_ext memory_profiler
   ...: print('as_frame=True')
   ...: %memit fetch_openml(name="mnist_784", as_frame=True)
   ...: 
   ...: print('as_frame=False')
   ...: %memit fetch_openml(name="mnist_784", as_frame=False)
   ...: 
as_frame=True
peak memory: 3572.36 MiB, increment: 3477.56 MiB
as_frame=False
peak memory: 1396.97 MiB, increment: 783.14 MiB

1reaction

glemaitrecommented, Dec 9, 2021

I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than as_frame=False but getting a dataframe:

In [8]:  %memit arff_reader_via_pandas("mnist_784.arff")
peak memory: 1212.86 MiB, increment: 483.52 MiB

The rough implementation is there:

# %%
def _strip_quotes(string):
    for quotes_char in ["'", '"']:
        if string.startswith(quotes_char) and string.endswith(quotes_char):
            string = string[1:-1]
    return string


# %%
def _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes):
    import pandas as pd
    dtypes = {}
    for feature_name in df.columns:
        pd_dtype = df[feature_name].dtype
        arff_dtype = arff_dtypes[feature_name]
        if arff_dtype.lower() in ("numeric", "real", "integer"):
            # pandas will properly parse numerical values
            dtypes[feature_name] = pd_dtype
        elif arff_dtype.startswith("{") and arff_dtype.endswith("}"):
            categories = arff_dtype[1:-1].split(",")
            categories = [_strip_quotes(category) for category in categories]
            if pd_dtype.kind == "i":
                categories = [int(category) for category in categories]
            elif pd_dtype.kind == "f":
                categories = [float(category) for category in categories]
            dtypes[feature_name] = pd.CategoricalDtype(categories)
        else:
            dtypes[feature_name] = pd_dtype
    return dtypes


# %%
line_tag_data = 0

columns = []
arff_dtypes = {}
with open(filename, "r") as f:
    for idx_line, line in enumerate(f):
        if line.lower().startswith("@attribute"):
            _, feature_name, feature_type = line.split()
            feature_name = _strip_quotes(feature_name)
            columns.append(feature_name)
            arff_dtypes[feature_name] = feature_type
        if line.lower().startswith("@data"):
            line_tag_data = idx_line
            break


# %%
def arff_reader_via_pandas(filename):
    import pandas as pd

    df = pd.read_csv(
        filename,
        skiprows=line_tag_data + 1,
        header=None,
        na_values=["?"],
    )
    df.columns = columns
    dtypes = _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes)
    df = df.astype(dtypes)

    return df

Top Results From Across the Web

sklearn.datasets.fetch_openml

OpenML ID of the dataset. The most specific way of retrieving a dataset. If data_id is not given, name (and potential version) are...

Cannot load MNIST Original dataset using fetch_openml in ...

Method fetch_openml() download dataset from mldata.org which is not stable and can not connect. An alternative way is manually to download ...

Optimizing Memory Usage Of Scikit-Learn Models Using ... - Zyte

On my machine, the loaded vectorizer uses about 82MB of memory in this case. If we add bigrams (by using CountVectorizer(ngram_range=(1,2))) then it...

RAM Problem When Using MNIST Dataset - Reddit

from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784', version=1). Somehow in the process, my RAM usage has skyrocketed ...

mnist_784 - OpenML

The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in...