question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask bag `take` ignores shared dependencies

See original GitHub issue

Normally, take seems to intelligently “propagate” its way down, so that a take(1) doesn’t evaluate the entire task graph before . This is super useful for exploring large data sets.

Unfortunately, this doesn’t seem to work for “diamond” / shared dependencies (like those created with zip), where the task graph looks like:

     --------
     | Take |
     --------
        /\
      /    \
    /        \
--------  --------
| Task |  | Task |
--------  --------
    \        /
      \    /
        \/
     --------
     | Base |
     --------

In this diamond dependency, even a trivial take(1) seems to evaluate the entire task graph, even though it should be easy to for take(1) to percolate down through the zip.

I saw this on Python 3.5 and dask 0.10.2 (both from conda-forge).

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Aug 3, 2016

Yup, so here’s what’s going on.

Normally dask.bag partitions are iterators, they stream data if possible. This means that it’s even nicer than normal dask collections like dask.dataframe because we only need to compute a tiny bit of a partition to get something like take 1. Because of this streaming-within-partitions ability dask.bag.read_text feels free to read each file as one partition by default.

This is great except when you tee off. When this happens we no longer trust iterators (because each part of the tee would separately consume the same iterator) so we reify everything into a list. Now we’re still good about only computing the right partitions but we’re bad about being lazy within each partition.

Unfortunately now our partition-per-file choice becomes bad, we’re reading the entire file. You can avoid this choice by explicitly sending in a chunksize=10000000 or something, which will break the file up into 10MB blocks. Then take(1) will only process 10000000 bytes instead of the entire file.

Does this help?

0reactions
jakirkhamcommented, Aug 8, 2019

It seems like it should still be possible (maybe even easier today) to optimize this case. Is there any interest in taking this on?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cryptic exceptions when using compute(get=dask.get ... - GitHub
After you call persist() your data lives in the dask.distributed workers. When you then call compute(get=dask.get) the local scheduler doesn't ...
Read more >
Source code for dask.bag.core
if dependencies is None: dependencies = {k: get_dependencies(dsk, task=v) for k, ... **Paths**: This will create one file for each partition in your...
Read more >
AttributeError: module 'dask' has no attribute 'delayed'
To install Dask with pip there are a few options, depending on which dependencies you would like to keep up to date:.
Read more >
Parallel Programming in the Cloud with Python Dask
Dask Arrays, Frames and Bags. Python programmers are used to numpy arrays, so Dask takes the approach to distributing arrays by maintaining ...
Read more >
(PDF) Parallel Programming in the Cloud with Python Dask
Dask Arrays, Frames and Bags. Python programmers are used to numpy arrays, so Dask takes the approach to distributing arrays by.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found