Wide DataFrames
See original GitHub issueRecently a few people have asked about dataframes that have many thousands of columns.
Generally the current design of dask.dataframe
doesn’t work well in this case; we partition on the index, not on different columns.
We could consider partitioning on columns though, this would be a completely separate design but might be useful. I suspect that this approach would be simpler to implement for the cases where it is relevant. By assuming that any column fits into memory most algorithms become fairly simple.
Issue Analytics
- State:
- Created 8 years ago
- Comments:21 (18 by maintainers)
Top Results From Across the Web
Wide vs. long data - Anvil Works
'Wide form' data is also sometimes called 'un-stacked'. Libraries that work best with wide data: Matplotlib, Plotly, Bokeh, PyGal, Pandas. Pandas DataFrames.
Read more >pandas.wide_to_long — pandas 1.5.2 documentation
A DataFrame that contains each stub name as a variable, with new index (i, j). See also. melt. Unpivot a DataFrame from wide...
Read more >Reshaping a Pandas Dataframe: Long-to-Wide and Vice Versa
To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd.pivot() method. pd.pivot(df, index=, columns=, values=) columns : ...
Read more >Displaying Wide pandas Dataframes Over Several, Narrower ...
In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over...
Read more >Wide and long data formats - Data Carpentry
Pandas provides methods for converting data from wide to long format and ... can be problems with the index when we concatenate two...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
A lot of people have wide dataframes, regardless of tidiness. This might not be a data engineering choice but instead perhaps a data storage choice. I can imagine storing all of my timeseries data for each stock separately and not wanting to reshuffle it all just to make my data tidy.
I would also like to be able to partition a dataframe by columns. In my case, I am doing natural language processing on a large number of documents. Each document has some metadata about it and also the full text (or some representation of it in tokens or parts of speech). On disk, there is one
csv
file for the metadata and one file for each full text.It would be helpful if I could store all of this in one dask dataframe, that only loaded the text when that column was needed. For example, a common task is first filtering on some metadata and then getting the text for those documents. Currently, it isn’t possible to do this without reading in all of the documents for rows we are scanning.