Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask dataframe from delayeds of dask dataframes

See original GitHub issue

tl;dr Can we please add (or could someone explain how this is already possible) creating dask dataframes from a function that returns a list of delayed(dd.DataFrame) without evaluating all the delayed(dd.DataFrame)s in the list? I’m okay with passing metadata up the chain.

I’ve been using dask to interact with sqlalchemy queries and to distribute larger-than-memory outputs with the pd.from_sql function with the chunksize argument, like the following: sub_df = dd.from_delayed([ client.scatter(x) for x in pd.read_sql(compiled_str_qry, engine, chunksize=chunksize)]) This works well, and distributes my pandas dataframes over a k8s cluster without overloading any single worker’s memory.

When I try to request only a part of the SQL query, I rely on sqlalchemy’s limit and offset. This doesn’t have ideal performance characteristics, but when working on a CTE I expect it to outperform loading the entire dataset in some cases. I’ve been hoping to use something of a mixed-methods approach to load the entire database as a dask dataframe, but it’s proving to be difficult namely due to the fact that you can’t get a dask dataframe from a list of delayed(dd.DataFrames).

It’s desirable in my use case for the ability to have a set of sqlalchemy queries that partition the resultant query CTE into chunks based on the limit/offset, and then to stitch the output of those functions (a dask dataframe, to allow for greater than memory queries) into a representative dataframe that contains all of them together. This would allow for lazy-loading of chunks of SQL, as the scheduler finds it appropriate. It would also allow for dropping whole tasks without having to re-request all the data at once. I believe this’d be good for resilience, whereas for the alternative of just loading the entire SQL query in one go, the scheduler would need to re-run that singular task, without the ability to discern which part failed.

Please let me know what you think! I’m personally not fluent enough in dask to make this happen, but there are a couple reasons I think this could be done: given I’m using a limit in SQL, I know the upper limit of the size that the dataframe will occupy (I know the exact size of all but the last dataframe), and that there’s the to_delayed and from_delayed interface. If there is a function that operates on delayeds, it should stand to reason it’d be possible to splice a set of delayeds into that list in place (to replace?) one of them on evaluation.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

NativeVexcommented, Aug 21, 2021

Very cool! I’ve understood the link to mean I should be offloading my data to a bucket, and that’s worked very well for the limited trial run I’ve done. Also I believe the documentation might be out of date here- https://docs.dask.org/en/latest/remote-data-services.html claims that you can use 'gcs.session.credentials,' but gcs.credentials.credentials is the only thing that’s worked for me as of right now.

Please keep me in the loop as things develop- I’d be eager to make more advanced usage of dask’s delayed interface, and I think it’ll likely be the main pipeline people use for data ingress.

Thanks for the pointers!

0reactions

ncclementicommented, Oct 14, 2021

@NativeVex it seems this issue is resolved by the recommendation of offloading data to a bucket. Thanks for pointing out the documentation issue, but that one seems unrelated to this issue. I will close this issue (if you think this is not resolved, feel free to re-open), but would you mind opening a separate issue explaining what you experience with 'gcs.session.credentials,'.

Top Results From Across the Web

dask.dataframe.from_delayed

A dask.delayed.Delayed , a distributed.Future , or an iterable of either of these objects, e.g. returned by client.submit . These comprise the individual ......

DataFrame.to_delayed

Convert into a list of dask.delayed objects, one per partition. Parameters. optimize_graphbool, optional. If True [default], the ...

Dask Delayed

The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, ...

Working with Collections

dask.delayed to lazily read these files into Pandas DataFrames, use · dd.from_delayed to wrap these pieces up into a single Dask DataFrame, use...

Create and Store Dask DataFrames

Dask delayed is particularly useful when simple map operations aren't sufficient to capture the complexity of your data layout. It lets you construct...