question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot persist dask.dataframes

See original GitHub issue

What happened:

DataFrame collections like dask dataframes or dask-cudf cannot be persisted after release 2021.2.0. @wphicks triaged that after the merge of this PR the issue started to present: https://github.com/dask/distributed/pull/4406

What you expected to happen:

Persist to work (see reproducer)

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import numpy as np
import pandas as pd

from dask.distributed import Client
from dask.distributed import LocalCluster


def persist_across_workers(client, objects, workers=None):
    if workers is None:
        # Default to all workers
        workers = client.has_what().keys()
    return client.persist(objects, workers={o: workers for o in objects})


if __name__ == "__main__":

    cluster = LocalCluster()
    client = Client(cluster)

    X = np.ones((10000, 20))

    X_df = pd.DataFrame(X)
    X_dist = dd.from_pandas(X_df, npartitions=2)

    X_f = persist_across_workers(client, X_dist)

Output:

distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 39, in dumps
    small_header, small_payload = dumps_msgpack(msg, **compress_opts)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 184, in dumps_msgpack
    payload = msgpack.dumps(msg, default=msgpack_encode_default, use_bin_type=True)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 292, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 298, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 295, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 289, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'dict_keys' object
distributed.comm.utils - ERROR - can not serialize 'dict_keys' object
Traceback (most recent call last):
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/comm/utils.py", line 32, in _to_frames
    protocol.dumps(
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 39, in dumps
    small_header, small_payload = dumps_msgpack(msg, **compress_opts)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 184, in dumps_msgpack
    payload = msgpack.dumps(msg, default=msgpack_encode_default, use_bin_type=True)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 292, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 298, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 295, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 289, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'dict_keys' object
distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
    nbytes = yield self.comm.write(
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/comm/tcp.py", line 230, in write
    frames = await to_frames(
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/comm/utils.py", line 52, in to_frames
    return _to_frames()
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/comm/utils.py", line 32, in _to_frames
    protocol.dumps(
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 39, in dumps
    small_header, small_payload = dumps_msgpack(msg, **compress_opts)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/distributed/protocol/core.py", line 184, in dumps_msgpack
    payload = msgpack.dumps(msg, default=msgpack_encode_default, use_bin_type=True)
  File "/home/galahad/miniconda3/envs/ns0208/lib/python3.8/site-packages/msgpack/__init__.py", line 35, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 292, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 298, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 295, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 289, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'dict_keys' object

Environment:

  • Dask version: 2021.2.0
  • Distributed version: 2021.2.0 from conda and built from master after https://github.com/dask/distributed/pull/4406
  • Python version: 3.7 and 3.8
  • Operating System: Linux / AMD64
  • Install method (conda, pip, source): conda and from source

cc @jakirkham @pentschev @madsbk @wphicks

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
quasibencommented, Feb 8, 2021

@trivialfis you might be interested in this – I think xgboost maybe does similar things ?

3reactions
ian-r-rosecommented, Feb 8, 2021

Yes, that’s right @jrbourbeau .

@jakirkham I agree that an improved error message would be helpful. At the very least, we could do a better job ensuring that the shape of the priority/workers/etc makes sense (i.e., iterable for workers, number for priority, error if dict-of-collections)

Read more comments on GitHub >

github_iconTop Results From Across the Web

unable to persist dask dataframe after read_sql_table
I am trying to read a database table into a dask dataframe and then persist the dataframe. I have tried a few variations,...
Read more >
Dask DataFrames Best Practices
Persist is important because Dask DataFrame is lazy by default. It is a way of telling the cluster that it should start executing...
Read more >
Storing Dask DataFrames in Memory with persist - Coiled
Many Dask users erroneously assume that Dask DataFrames are persisted in memory by default, which isn't true. Dask runs computations in memory.
Read more >
Dask and Pandas: There's No Such Thing as Too Much Data
Another option is to switch from your pandas Dataframe objects to Dask Dataframes, which is what we'll do here.
Read more >
How we learned to love Dask and achieved a 40x speedup
The problem I need to solve is embarrassingly parallel, ... We have used Dask dataframes previously for one-off computations, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found