question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question: Optimization and Dask dataframe to_parquet

See original GitHub issue

Hello, I’m trying to both use Dask with my workflow and experiment with different ways to do the same task to learn Dask and get an intuition of which API (array, dataframe, bag, delayed) to use. Therefore, some feedback on the thought process would be greatly appreciated. I hope that I can learn intuition/principles from a small example and extrapolate to bigger workflows. I also have questions about how Dask dataframe is written to parquet.

Below is a subset of my computational workflow : image

Some background: the final Dask Dataframe can have up to 100k columns and 1 million rows. The parser for my input data returns test_arr as Dask-delayed objects so I can’t change that. I need each each test_arr to map to a column because in next step of the pipeline, I only need a subset of columns as any given moment (but all the columns should be batched together as a dataset entity).

I’m only showing 2 conditions and 2 choices in np.select, but I have up to 10 pairs with very different logic.

Is there a better way to do what I did? I resorted to using delayed all the way for my use case because: (these opinions are from my limited experience, please correct me if I’m wrong)

  1. Dask Dataframe: hard to initialize efficiently/ if you are not reading it from a file (csv,parquet etc). There is no easy way to initialize dataframe by columns, only through assigning one by one or dd.concat

  2. Dask Array : Similar to Dataframe. Since there is no dask.array.selectto mimic the numpy API, I would have to use numpy then convert to dask array, and doing so would create a huge array in memory, which defeats the purpose. I couldn’t convert my test_arr to a Dask array using from_delayed because I need to know the rows for the shape argument in da.from_delayed, which results in computing the array to memory. Even then, the lack of the convenience function da.select would make my code more lengthy than needed due to combination of multiple conditions and choices.

  3. Dask bag: Although Dask bag can be used in much more flexible ways, it suffers in speed. I have tried mapping simple functions onto dask bag and realize that it is much much slower. Even when I have a sequence of length in the millions, I would batch them with delayed instead of using bags. In my mind, bags are the last resort when nothing else works. Also, I’m doing mostly numpy/pandas calls so I would lose out on performance with bag. Also, to convert bag into dataframe, each element in the bag should be a row, whereas I have each entry as a column.

From the above, the logical way is to use delayed, flexible, fast, and take advantage of already-exist numpy functions.

Parquet: With a table of this size (100k column and 1 million rows), I’m afraid I can’t pull more than one column of the data in memory of local cluster at any point, up to outputting to parquet (talking to IT at my work to lower their restrictions on our cluster to allow dask-worker jobs). When outputting to parquet, does Dask do it one column at a time? How can I specify that each column is an independent unit?

If there are some resources on any of my questions above, a link instead of explanation is more than enough. I might have missed something while reading the documentation. Thanks in advance!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, May 14, 2019

I’ll certainly acknowledge that providing minimal examples is difficult 😃 It seems like my initial skim was off base, so I’ll hope that you can get something put together.

As an aside, code snippets are typically preferred to screenshots (makes it easier to try out & think about things locally).

On Tue, May 14, 2019 at 3:57 PM hoangthienan95 notifications@github.com wrote:

The data parser gives genotype, which is a list of delayed dictionaries. The array I care about is in the key “probs” [image: image] https://user-images.githubusercontent.com/25307953/57731732-59198880-7669-11e9-9f72-711dac66e6ec.png

[image: image] https://user-images.githubusercontent.com/25307953/57731675-37b89c80-7669-11e9-8f30-a8d817912af3.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4795?email_source=notifications&email_token=AAKAOIUHCMIYNJG36RQEWETPVMRUDA5CNFSM4HMVDPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVMYQZY#issuecomment-492406887, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOISI3COITZZVWIXCCRTPVMRUDANCNFSM4HMVDPKA .

0reactions
TomAugspurgercommented, Jun 17, 2019

I think something like dd.from_dask_array(dask_array, index=df.index) would work.

On Mon, Jun 17, 2019 at 3:42 PM hoangthienan95 notifications@github.com wrote:

Thank you @TomAugspurger https://github.com/TomAugspurger, that worked. However, I have other parts of my code where I have to add columns while keeping the index that this cannot be applied. In general, if you can create an array from applying a numpy function on dask dataframe df, how do you convert this to a Dask Series while retaining the index of df so it could be appended? (assuming that the function is also implemented in dask.array)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4795?email_source=notifications&email_token=AAKAOIQUBT37RQOOZBSEC2LP27ZL7A5CNFSM4HMVDPKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4MOXA#issuecomment-502843228, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIX2NDOWHZXEVAUQZCLP27ZL7ANCNFSM4HMVDPKA .

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.dataframe.to_parquet - Dask documentation
Parquet library to use. Defaults to 'auto', which uses pyarrow if it is installed, and falls back to fastparquet otherwise. compressionstring or ...
Read more >
Dask DataFrame.to_parquet fails on read - repartition
read_parquet is a single concept to dask. It can tune and optimize however it needs within this task. That includes "peeking" at the...
Read more >
Python and Parquet Performance - Data Syndrome
This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, ...
Read more >
Creating Disk Partitioned Lakes with Dask using partition_on
Let's start by outputting a Dask DataFrame with to_parquet using disk ... Disk partitioning is a powerful performance optimization that can ...
Read more >
dask/dask - Gitter
I need to index a large dask dataframe (about 300 partitions of 1GB - parquet ... @odovad that sounds like a great question...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found