dask dataframe from delayeds of dask dataframes
See original GitHub issuetl;dr Can we please add (or could someone explain how this is already possible) creating dask dataframes from a function that returns a list of delayed(dd.DataFrame) without evaluating all the delayed(dd.DataFrame)s in the list? I’m okay with passing metadata up the chain.
I’ve been using dask to interact with sqlalchemy queries and to distribute larger-than-memory outputs with the pd.from_sql function with the chunksize argument, like the following:
sub_df = dd.from_delayed([ client.scatter(x) for x in pd.read_sql(compiled_str_qry, engine, chunksize=chunksize)])
This works well, and distributes my pandas dataframes over a k8s cluster without overloading any single worker’s memory.
When I try to request only a part of the SQL query, I rely on sqlalchemy’s limit and offset. This doesn’t have ideal performance characteristics, but when working on a CTE I expect it to outperform loading the entire dataset in some cases. I’ve been hoping to use something of a mixed-methods approach to load the entire database as a dask dataframe, but it’s proving to be difficult namely due to the fact that you can’t get a dask dataframe from a list of delayed(dd.DataFrames).
It’s desirable in my use case for the ability to have a set of sqlalchemy queries that partition the resultant query CTE into chunks based on the limit/offset, and then to stitch the output of those functions (a dask dataframe, to allow for greater than memory queries) into a representative dataframe that contains all of them together. This would allow for lazy-loading of chunks of SQL, as the scheduler finds it appropriate. It would also allow for dropping whole tasks without having to re-request all the data at once. I believe this’d be good for resilience, whereas for the alternative of just loading the entire SQL query in one go, the scheduler would need to re-run that singular task, without the ability to discern which part failed.
Please let me know what you think! I’m personally not fluent enough in dask to make this happen, but there are a couple reasons I think this could be done: given I’m using a limit in SQL, I know the upper limit of the size that the dataframe will occupy (I know the exact size of all but the last dataframe), and that there’s the to_delayed and from_delayed interface. If there is a function that operates on delayeds, it should stand to reason it’d be possible to splice a set of delayeds into that list in place (to replace?) one of them on evaluation.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Very cool! I’ve understood the link to mean I should be offloading my data to a bucket, and that’s worked very well for the limited trial run I’ve done. Also I believe the documentation might be out of date here- https://docs.dask.org/en/latest/remote-data-services.html claims that you can use
'gcs.session.credentials,'
butgcs.credentials.credentials
is the only thing that’s worked for me as of right now.Please keep me in the loop as things develop- I’d be eager to make more advanced usage of dask’s delayed interface, and I think it’ll likely be the main pipeline people use for data ingress.
Thanks for the pointers!
@NativeVex it seems this issue is resolved by the recommendation of offloading data to a bucket. Thanks for pointing out the documentation issue, but that one seems unrelated to this issue. I will close this issue (if you think this is not resolved, feel free to re-open), but would you mind opening a separate issue explaining what you experience with
'gcs.session.credentials,'
.