Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: Optimization and Dask dataframe to_parquet

See original GitHub issue

Hello, I’m trying to both use Dask with my workflow and experiment with different ways to do the same task to learn Dask and get an intuition of which API (array, dataframe, bag, delayed) to use. Therefore, some feedback on the thought process would be greatly appreciated. I hope that I can learn intuition/principles from a small example and extrapolate to bigger workflows. I also have questions about how Dask dataframe is written to parquet.

Below is a subset of my computational workflow :

Some background: the final Dask Dataframe can have up to 100k columns and 1 million rows. The parser for my input data returns test_arr as Dask-delayed objects so I can’t change that. I need each each test_arr to map to a column because in next step of the pipeline, I only need a subset of columns as any given moment (but all the columns should be batched together as a dataset entity).

I’m only showing 2 conditions and 2 choices in np.select, but I have up to 10 pairs with very different logic.

Is there a better way to do what I did? I resorted to using delayed all the way for my use case because: (these opinions are from my limited experience, please correct me if I’m wrong)

Dask Dataframe: hard to initialize efficiently/ if you are not reading it from a file (csv,parquet etc). There is no easy way to initialize dataframe by columns, only through assigning one by one or dd.concat
Dask Array : Similar to Dataframe. Since there is no dask.array.selectto mimic the numpy API, I would have to use numpy then convert to dask array, and doing so would create a huge array in memory, which defeats the purpose. I couldn’t convert my test_arr to a Dask array using from_delayed because I need to know the rows for the shape argument in da.from_delayed, which results in computing the array to memory. Even then, the lack of the convenience function da.select would make my code more lengthy than needed due to combination of multiple conditions and choices.
Dask bag: Although Dask bag can be used in much more flexible ways, it suffers in speed. I have tried mapping simple functions onto dask bag and realize that it is much much slower. Even when I have a sequence of length in the millions, I would batch them with delayed instead of using bags. In my mind, bags are the last resort when nothing else works. Also, I’m doing mostly numpy/pandas calls so I would lose out on performance with bag. Also, to convert bag into dataframe, each element in the bag should be a row, whereas I have each entry as a column.

From the above, the logical way is to use delayed, flexible, fast, and take advantage of already-exist numpy functions.

Parquet: With a table of this size (100k column and 1 million rows), I’m afraid I can’t pull more than one column of the data in memory of local cluster at any point, up to outputting to parquet (talking to IT at my work to lower their restrictions on our cluster to allow dask-worker jobs). When outputting to parquet, does Dask do it one column at a time? How can I specify that each column is an independent unit?

If there are some resources on any of my questions above, a link instead of explanation is more than enough. I might have missed something while reading the documentation. Thanks in advance!

Issue Analytics

State:
Created 4 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, May 14, 2019

I’ll certainly acknowledge that providing minimal examples is difficult 😃 It seems like my initial skim was off base, so I’ll hope that you can get something put together.

As an aside, code snippets are typically preferred to screenshots (makes it easier to try out & think about things locally).

On Tue, May 14, 2019 at 3:57 PM hoangthienan95 notifications@github.com wrote:

The data parser gives genotype, which is a list of delayed dictionaries. The array I care about is in the key “probs” [image: image] https://user-images.githubusercontent.com/25307953/57731732-59198880-7669-11e9-9f72-711dac66e6ec.png

[image: image] https://user-images.githubusercontent.com/25307953/57731675-37b89c80-7669-11e9-8f30-a8d817912af3.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4795?email_source=notifications&email_token=AAKAOIUHCMIYNJG36RQEWETPVMRUDA5CNFSM4HMVDPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMYQZY#issuecomment-492406887, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOISI3COITZZVWIXCCRTPVMRUDANCNFSM4HMVDPKA .

0reactions

TomAugspurgercommented, Jun 17, 2019

I think something like dd.from_dask_array(dask_array, index=df.index) would work.

On Mon, Jun 17, 2019 at 3:42 PM hoangthienan95 notifications@github.com wrote:

Thank you @TomAugspurger https://github.com/TomAugspurger, that worked. However, I have other parts of my code where I have to add columns while keeping the index that this cannot be applied. In general, if you can create an array from applying a numpy function on dask dataframe df, how do you convert this to a Dask Series while retaining the index of df so it could be appended? (assuming that the function is also implemented in dask.array)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4795?email_source=notifications&email_token=AAKAOIQUBT37RQOOZBSEC2LP27ZL7A5CNFSM4HMVDPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4MOXA#issuecomment-502843228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIX2NDOWHZXEVAUQZCLP27ZL7ANCNFSM4HMVDPKA .