Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions or map_blocks with large objects eats up scheduler memory

See original GitHub issue

Opening this report from investigation in https://github.com/dask/dask-ml/issues/842 and https://github.com/dask/dask-ml/pull/843

When passing some large objects as arguments to map_blocks / map_partitions, the scheduler memory can quickly be overwhelmed. Large objects should be wrapped in delayed to add them to graph and avoid this issue (see here) , but the way the object sizes are determined miss some large objects. In the cases I’ve encountered, it does not correctly compute the size of scikit-learn estimators.

This is because sys.getsizeof does not properly traverse object references to compute the “real” size of an object. This is a notoriously difficult thing to do in Python, so not sure what the best course of action here should be.

Minimal Complete Verifiable Example:

Running on machine with 2 cores and 16 GB of RAM

from dask_ml.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pickle
import sys
import dask
from dask.distributed import Client

client = Client()

X, y = make_classification(
    n_samples=50000,
    chunks=1000,
    random_state=42,
)

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
_ = rf.fit(X, y)


def dask_predict(part, model):
    return model.predict(part)

preds = X.map_blocks(
    dask_predict,
    model=rf,
    dtype="int",
    drop_axis=1,
)
preds.compute()

Scheduler memory ballons up to this:

And ends up with this error:

KilledWorker: ("('normal-dask_predict-4b858b85224825aeb2d45678c4c91d39', 27)", <WorkerState 'tcp://127.0.0.1:33369', name: 0, memory: 0, processing: 50>)

If we explictly delayed the rf object,

rf_delayed = dask.delayed(rf)
preds = X.map_blocks(
    dask_predict,
    model=rf_delayed,
    dtype="int",
    drop_axis=1,
    meta=np.array([1]),
)
preds.compute()

the memory use looks like this:

Environment:

Dask version: 2021.5.1
Python version: 3.7
Operating System: ubuntu
Install method (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

jrbourbeaucommented, Jul 2, 2021

Thanks for the clear writeup @rikturr! Indeed it looks like we’re underestimating the size of the RandomForestClassifier model in your example:

In [1]: from sklearn.ensemble import RandomForestClassifier

In [2]: rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

In [3]: from dask.sizeof import sizeof

In [4]: sizeof(rf)
Out[4]: 48

One way to improve the situation is to register a custom sizeof implementation in dask/sizeof.py for scikit-learn estimators which more accurately captures the memory footprint of an estimator.

One thing that comes to mind is to include information from estimator.get_params()

In [5]: rf.get_params()
Out[5]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [6]: sizeof(rf.get_params())
Out[6]: 2604

though there may be other attributes which are stored on the class, but not captured by get_params.

cc @thomasjpfan as you might find this interesting

0reactions

jsignellcommented, Jul 19, 2021

I prefer Option 2b. It seems like a good idea to do as well as we can with the sizeof estimation.