Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wide DataFrames

See original GitHub issue

Recently a few people have asked about dataframes that have many thousands of columns.

Generally the current design of dask.dataframe doesn’t work well in this case; we partition on the index, not on different columns.

We could consider partitioning on columns though, this would be a completely separate design but might be useful. I suspect that this approach would be simpler to implement for the cases where it is relevant. By assuming that any column fits into memory most algorithms become fairly simple.

Issue Analytics

State:
Created 8 years ago
Comments:21 (18 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, May 18, 2016

A lot of people have wide dataframes, regardless of tidiness. This might not be a data engineering choice but instead perhaps a data storage choice. I can imagine storing all of my timeseries data for each stock separately and not wanting to reshuffle it all just to make my data tidy.

0reactions

saulshanabrookcommented, Nov 11, 2016

I would also like to be able to partition a dataframe by columns. In my case, I am doing natural language processing on a large number of documents. Each document has some metadata about it and also the full text (or some representation of it in tokens or parts of speech). On disk, there is one csv file for the metadata and one file for each full text.

It would be helpful if I could store all of this in one dask dataframe, that only loaded the text when that column was needed. For example, a common task is first filtering on some metadata and then getting the text for those documents. Currently, it isn’t possible to do this without reading in all of the documents for rows we are scanning.

Top Results From Across the Web

Wide vs. long data - Anvil Works

'Wide form' data is also sometimes called 'un-stacked'. Libraries that work best with wide data: Matplotlib, Plotly, Bokeh, PyGal, Pandas. Pandas DataFrames.

pandas.wide_to_long — pandas 1.5.2 documentation

A DataFrame that contains each stub name as a variable, with new index (i, j). See also. melt. Unpivot a DataFrame from wide...

Reshaping a Pandas Dataframe: Long-to-Wide and Vice Versa

To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd.pivot() method. pd.pivot(df, index=, columns=, values=) columns : ...

Displaying Wide pandas Dataframes Over Several, Narrower ...

In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over...

Wide and long data formats - Data Carpentry

Pandas provides methods for converting data from wide to long format and ... can be problems with the index when we concatenate two...