question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recently a few people have asked about dataframes that have many thousands of columns.

Generally the current design of dask.dataframe doesn’t work well in this case; we partition on the index, not on different columns.

We could consider partitioning on columns though, this would be a completely separate design but might be useful. I suspect that this approach would be simpler to implement for the cases where it is relevant. By assuming that any column fits into memory most algorithms become fairly simple.

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:21 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, May 18, 2016

A lot of people have wide dataframes, regardless of tidiness. This might not be a data engineering choice but instead perhaps a data storage choice. I can imagine storing all of my timeseries data for each stock separately and not wanting to reshuffle it all just to make my data tidy.

0reactions
saulshanabrookcommented, Nov 11, 2016

I would also like to be able to partition a dataframe by columns. In my case, I am doing natural language processing on a large number of documents. Each document has some metadata about it and also the full text (or some representation of it in tokens or parts of speech). On disk, there is one csv file for the metadata and one file for each full text.

It would be helpful if I could store all of this in one dask dataframe, that only loaded the text when that column was needed. For example, a common task is first filtering on some metadata and then getting the text for those documents. Currently, it isn’t possible to do this without reading in all of the documents for rows we are scanning.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wide vs. long data - Anvil Works
'Wide form' data is also sometimes called 'un-stacked'. Libraries that work best with wide data: Matplotlib, Plotly, Bokeh, PyGal, Pandas. Pandas DataFrames.
Read more >
pandas.wide_to_long — pandas 1.5.2 documentation
A DataFrame that contains each stub name as a variable, with new index (i, j). See also. melt. Unpivot a DataFrame from wide...
Read more >
Reshaping a Pandas Dataframe: Long-to-Wide and Vice Versa
To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd.pivot() method. pd.pivot(df, index=, columns=, values=) columns : ...
Read more >
Displaying Wide pandas Dataframes Over Several, Narrower ...
In an HTML output, the wide dataframe is rendered via a scroll box; in a PDF output, the content is just lost over...
Read more >
Wide and long data formats - Data Carpentry
Pandas provides methods for converting data from wide to long format and ... can be problems with the index when we concatenate two...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found