question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Foldby not combining in parallel.

See original GitHub issue

Hey guys.

Been using dask for ad-hoc big data analysis creating custom reports while processing a few terabytes of json data.

I’ve noticed that the foldby operation has a huge sequential task that is computed after it has applied the binop to each partition. In theory, if a combine function is supplied couldn’t this be done in parallel by taking the output of each partition and combining them in a pairwise fashion?

Does this make sense or did I miss something?

Sample code:

(
    bag
    .filter(predicate)
    .map(js.loads)
    .map(transform)
    .foldby(key, binop, initial, combine, initial)
    .compute()
)

image

Cheers!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Sep 14, 2017

There are a few other tree reduction implementations around in the codebase. You might want to search for split_every in the codebase.

1reaction
mrocklincommented, Sep 13, 2017

I’m not sure why dictitems is not applied in the second case. Do any of the tests fail with its addition? That might be a useful way to see an unintended change.

My bad, I’ve been using open-source software for about a little over a year and this is my first time contributing. Still a n00b.

Different groups operate differently, but most dask devs congregate on github rather than gitter.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.bag.Bag.foldby - Dask documentation
Foldby provides a combined groupby and reduce for efficient parallel ... and if you provide a key that is not a callable function...
Read more >
dask bag foldby with numpy arrays - python - Stack Overflow
Just adding up two numpy arrays without dask does not produce that so there's clearly some involvement with the parallel .foldby here.
Read more >
Combining DNP NMR with segmental and specific labeling to ...
Therefore, the monomers in amyloid fibrils of lysate-templated NM did not adopt a parallel in-register arrangement. Discussion. Yeast prions ...
Read more >
Bag: Parallel Lists for semi-structured data - 《Dask Tutorial ...
If we combine this with parallel processing then we can churn through a fair ... but it still cannot match dask.bag.foldby() for this...
Read more >
4.3: Combining Parallel Components - Engineering LibreTexts
First, voltage sources are not placed in parallel as a general rule, see Figure 4.3.1 . The reason is because a parallel connection...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found