Foldby not combining in parallel.
See original GitHub issueHey guys.
Been using dask for ad-hoc big data analysis creating custom reports while processing a few terabytes of json data.
I’ve noticed that the foldby
operation has a huge sequential task that is computed after it has applied the binop
to each partition. In theory, if a combine
function is supplied couldn’t this be done in parallel by taking the output of each partition and combining them in a pairwise fashion?
Does this make sense or did I miss something?
Sample code:
(
bag
.filter(predicate)
.map(js.loads)
.map(transform)
.foldby(key, binop, initial, combine, initial)
.compute()
)
Cheers!
Issue Analytics
- State:
- Created 6 years ago
- Comments:19 (19 by maintainers)
Top Results From Across the Web
dask.bag.Bag.foldby - Dask documentation
Foldby provides a combined groupby and reduce for efficient parallel ... and if you provide a key that is not a callable function...
Read more >dask bag foldby with numpy arrays - python - Stack Overflow
Just adding up two numpy arrays without dask does not produce that so there's clearly some involvement with the parallel .foldby here.
Read more >Combining DNP NMR with segmental and specific labeling to ...
Therefore, the monomers in amyloid fibrils of lysate-templated NM did not adopt a parallel in-register arrangement. Discussion. Yeast prions ...
Read more >Bag: Parallel Lists for semi-structured data - 《Dask Tutorial ...
If we combine this with parallel processing then we can churn through a fair ... but it still cannot match dask.bag.foldby() for this...
Read more >4.3: Combining Parallel Components - Engineering LibreTexts
First, voltage sources are not placed in parallel as a general rule, see Figure 4.3.1 . The reason is because a parallel connection...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
There are a few other tree reduction implementations around in the codebase. You might want to search for
split_every
in the codebase.I’m not sure why dictitems is not applied in the second case. Do any of the tests fail with its addition? That might be a useful way to see an unintended change.
Different groups operate differently, but most dask devs congregate on github rather than gitter.