Question: Optimization and Dask dataframe to_parquet
See original GitHub issueHello, I’m trying to both use Dask with my workflow and experiment with different ways to do the same task to learn Dask and get an intuition of which API (array, dataframe, bag, delayed) to use. Therefore, some feedback on the thought process would be greatly appreciated. I hope that I can learn intuition/principles from a small example and extrapolate to bigger workflows. I also have questions about how Dask dataframe is written to parquet.
Below is a subset of my computational workflow :
Some background: the final Dask Dataframe can have up to 100k columns and 1 million rows. The parser for my input data returns test_arr
as Dask-delayed objects so I can’t change that. I need each each test_arr
to map to a column because in next step of the pipeline, I only need a subset of columns as any given moment (but all the columns should be batched together as a dataset entity).
I’m only showing 2 conditions and 2 choices in np.select
, but I have up to 10 pairs with very different logic.
Is there a better way to do what I did? I resorted to using delayed
all the way for my use case because: (these opinions are from my limited experience, please correct me if I’m wrong)
-
Dask Dataframe: hard to initialize efficiently/ if you are not reading it from a file (
csv
,parquet
etc). There is no easy way to initialize dataframe by columns, only through assigning one by one ordd.concat
-
Dask Array : Similar to
Dataframe
. Since there is nodask.array.select
to mimic the numpy API, I would have to use numpy then convert to dask array, and doing so would create a huge array in memory, which defeats the purpose. I couldn’t convert mytest_arr
to aDask array
usingfrom_delayed
because I need to know the rows for theshape
argument inda.from_delayed
, which results in computing the array to memory. Even then, the lack of the convenience functionda.select
would make my code more lengthy than needed due to combination of multiple conditions and choices. -
Dask bag: Although Dask bag can be used in much more flexible ways, it suffers in speed. I have tried mapping simple functions onto dask bag and realize that it is much much slower. Even when I have a sequence of length in the millions, I would batch them with delayed instead of using bags. In my mind, bags are the last resort when nothing else works. Also, I’m doing mostly numpy/pandas calls so I would lose out on performance with
bag
. Also, to convertbag
intodataframe
, each element in the bag should be a row, whereas I have each entry as a column.
From the above, the logical way is to use delayed
, flexible, fast, and take advantage of already-exist numpy functions.
Parquet: With a table of this size (100k column and 1 million rows), I’m afraid I can’t pull more than one column of the data in memory of local cluster at any point, up to outputting to parquet (talking to IT at my work to lower their restrictions on our cluster to allow dask-worker
jobs). When outputting to parquet, does Dask do it one column at a time? How can I specify that each column is an independent unit?
If there are some resources on any of my questions above, a link instead of explanation is more than enough. I might have missed something while reading the documentation. Thanks in advance!
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (6 by maintainers)
I’ll certainly acknowledge that providing minimal examples is difficult 😃 It seems like my initial skim was off base, so I’ll hope that you can get something put together.
As an aside, code snippets are typically preferred to screenshots (makes it easier to try out & think about things locally).
On Tue, May 14, 2019 at 3:57 PM hoangthienan95 notifications@github.com wrote:
I think something like
dd.from_dask_array(dask_array, index=df.index)
would work.On Mon, Jun 17, 2019 at 3:42 PM hoangthienan95 notifications@github.com wrote: