question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask dataframe from delayeds of dask dataframes

See original GitHub issue

tl;dr Can we please add (or could someone explain how this is already possible) creating dask dataframes from a function that returns a list of delayed(dd.DataFrame) without evaluating all the delayed(dd.DataFrame)s in the list? I’m okay with passing metadata up the chain.

I’ve been using dask to interact with sqlalchemy queries and to distribute larger-than-memory outputs with the pd.from_sql function with the chunksize argument, like the following: sub_df = dd.from_delayed([ client.scatter(x) for x in pd.read_sql(compiled_str_qry, engine, chunksize=chunksize)]) This works well, and distributes my pandas dataframes over a k8s cluster without overloading any single worker’s memory.

When I try to request only a part of the SQL query, I rely on sqlalchemy’s limit and offset. This doesn’t have ideal performance characteristics, but when working on a CTE I expect it to outperform loading the entire dataset in some cases. I’ve been hoping to use something of a mixed-methods approach to load the entire database as a dask dataframe, but it’s proving to be difficult namely due to the fact that you can’t get a dask dataframe from a list of delayed(dd.DataFrames).

It’s desirable in my use case for the ability to have a set of sqlalchemy queries that partition the resultant query CTE into chunks based on the limit/offset, and then to stitch the output of those functions (a dask dataframe, to allow for greater than memory queries) into a representative dataframe that contains all of them together. This would allow for lazy-loading of chunks of SQL, as the scheduler finds it appropriate. It would also allow for dropping whole tasks without having to re-request all the data at once. I believe this’d be good for resilience, whereas for the alternative of just loading the entire SQL query in one go, the scheduler would need to re-run that singular task, without the ability to discern which part failed.

Please let me know what you think! I’m personally not fluent enough in dask to make this happen, but there are a couple reasons I think this could be done: given I’m using a limit in SQL, I know the upper limit of the size that the dataframe will occupy (I know the exact size of all but the last dataframe), and that there’s the to_delayed and from_delayed interface. If there is a function that operates on delayeds, it should stand to reason it’d be possible to splice a set of delayeds into that list in place (to replace?) one of them on evaluation.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
NativeVexcommented, Aug 21, 2021

Very cool! I’ve understood the link to mean I should be offloading my data to a bucket, and that’s worked very well for the limited trial run I’ve done. Also I believe the documentation might be out of date here- https://docs.dask.org/en/latest/remote-data-services.html claims that you can use 'gcs.session.credentials,' but gcs.credentials.credentials is the only thing that’s worked for me as of right now.

Please keep me in the loop as things develop- I’d be eager to make more advanced usage of dask’s delayed interface, and I think it’ll likely be the main pipeline people use for data ingress.

Thanks for the pointers!

0reactions
ncclementicommented, Oct 14, 2021

@NativeVex it seems this issue is resolved by the recommendation of offloading data to a bucket. Thanks for pointing out the documentation issue, but that one seems unrelated to this issue. I will close this issue (if you think this is not resolved, feel free to re-open), but would you mind opening a separate issue explaining what you experience with 'gcs.session.credentials,'.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.dataframe.from_delayed
A dask.delayed.Delayed , a distributed.Future , or an iterable of either of these objects, e.g. returned by client.submit . These comprise the individual ......
Read more >
DataFrame.to_delayed
Convert into a list of dask.delayed objects, one per partition. Parameters. optimize_graphbool, optional. If True [default], the ...
Read more >
Dask Delayed
The Dask delayed function decorates your functions so that they operate lazily. Rather than executing your function immediately, it will defer execution, ...
Read more >
Working with Collections
dask.delayed to lazily read these files into Pandas DataFrames, use · dd.from_delayed to wrap these pieces up into a single Dask DataFrame, use...
Read more >
Create and Store Dask DataFrames
Dask delayed is particularly useful when simple map operations aren't sufficient to capture the complexity of your data layout. It lets you construct...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found