question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions or map_blocks with large objects eats up scheduler memory

See original GitHub issue

Opening this report from investigation in https://github.com/dask/dask-ml/issues/842 and https://github.com/dask/dask-ml/pull/843

When passing some large objects as arguments to map_blocks / map_partitions, the scheduler memory can quickly be overwhelmed. Large objects should be wrapped in delayed to add them to graph and avoid this issue (see here) , but the way the object sizes are determined miss some large objects. In the cases I’ve encountered, it does not correctly compute the size of scikit-learn estimators.

This is because sys.getsizeof does not properly traverse object references to compute the “real” size of an object. This is a notoriously difficult thing to do in Python, so not sure what the best course of action here should be.

Minimal Complete Verifiable Example:

Running on machine with 2 cores and 16 GB of RAM

from dask_ml.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pickle
import sys
import dask
from dask.distributed import Client

client = Client()

X, y = make_classification(
    n_samples=50000,
    chunks=1000,
    random_state=42,
)

rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
_ = rf.fit(X, y)


def dask_predict(part, model):
    return model.predict(part)

preds = X.map_blocks(
    dask_predict,
    model=rf,
    dtype="int",
    drop_axis=1,
)
preds.compute()

Scheduler memory ballons up to this:

image

And ends up with this error:

KilledWorker: ("('normal-dask_predict-4b858b85224825aeb2d45678c4c91d39', 27)", <WorkerState 'tcp://127.0.0.1:33369', name: 0, memory: 0, processing: 50>)

If we explictly delayed the rf object,

rf_delayed = dask.delayed(rf)
preds = X.map_blocks(
    dask_predict,
    model=rf_delayed,
    dtype="int",
    drop_axis=1,
    meta=np.array([1]),
)
preds.compute()

the memory use looks like this:

Screen Shot 2021-07-02 at 4 00 52 PM

Environment:

  • Dask version: 2021.5.1
  • Python version: 3.7
  • Operating System: ubuntu
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jrbourbeaucommented, Jul 2, 2021

Thanks for the clear writeup @rikturr! Indeed it looks like we’re underestimating the size of the RandomForestClassifier model in your example:

In [1]: from sklearn.ensemble import RandomForestClassifier

In [2]: rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

In [3]: from dask.sizeof import sizeof

In [4]: sizeof(rf)
Out[4]: 48

One way to improve the situation is to register a custom sizeof implementation in dask/sizeof.py for scikit-learn estimators which more accurately captures the memory footprint of an estimator.

One thing that comes to mind is to include information from estimator.get_params()

In [5]: rf.get_params()
Out[5]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [6]: sizeof(rf.get_params())
Out[6]: 2604

though there may be other attributes which are stored on the class, but not captured by get_params.

cc @thomasjpfan as you might find this interesting

0reactions
jsignellcommented, Jul 19, 2021

I prefer Option 2b. It seems like a good idea to do as well as we can with the sizeof estimation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Spark map() vs mapPartitions() with Examples
Spark map() and mapPartitions() transformations apply the function on each element/record/row of the DataFrame/Dataset and returns the new.
Read more >
Apache Spark: map vs mapPartitions? - Stack Overflow
map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level. Example Scenario ...
Read more >
PySpark faster toPandas using mapPartitions - gists · GitHub
I am running into the memory problem. This works on about 500,000 rows, but runs out of memory with anything larger. I am...
Read more >
Apache Spark: MapPartitions — A Powerful Narrow Data ...
Apache Spark, on a high level, provides two types of data transformation for use in data analytics programs, the Narrow ones, and the...
Read more >
Explain Spark map() and mapPartitions() - ProjectPro
map() – Spark map() transformation applies a function to each row in a DataFrame/Dataset and returns the new transformed Dataset. mapPartitions ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found