question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can we avoid having a cell with a list?

See original GitHub issue

As we know, it’s really not recommended to store a list in a Pandas cell. TokenSeries and VectorSeries, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?

Need to discuss:

  • Alternatives using sub-columns (it’s still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex …
  • Can we just use RepresentationSeries? Probably not as we cannot merge it into a DataFrame with a single index, other alternatives than data alignment with reindex (too complicated)?

@mk2510 @henrifroese

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
henrifroesecommented, Aug 18, 2020

After some more discussions, we have come to the following conclusions:

Conclusions:

  • Keep VectorSeries and TokenSeries as-is. Reason: inserting into existing DataFrames is too expensive for users
  • Change RepresentationSeries output of tfidf, count, term_frequency to new sparse VectorDF that we will henceforth call DocumentTermDF. Reason: it looks nicer and is just as performant
  • All Dimensionality Reduction Functions and Clustering Functions will support both DocumentTermDF and VectorSeries input
1reaction
henrifroesecommented, Aug 16, 2020

(Not so positive) update:

We have sadly now noticed this (so what we’re doing not just with Sparse stuff but overall in this issue is maybe not a viable solution after all 😕):

So our main issue is that we want to

  • store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with “Subcolumns”)
  • and allow users to place this in their DataFrame with df["pca"] = ....

The problem we’re now facing with our implementation:

When inserting a big matrix, so a DF with maybe 1000 subcolumns, pandas starts acting weird due to its block manager. See HERE for a great introduction to the topic. We’re basically looking for a way to performantly add many many columns to a DF.

Two things happen:

  • Pandas tries to consolidate columns of the same dtype into “blocks”, which requires copying data around. If we now insert 5000 new columns, all the data has to be copied instead of just referenced.

  • Weirdly, when doing

    >>> x = np.random.normal(size=(10000, 5000))
    >>> df_x = pd.DataFrame(x)
    >>> y = np.random.normal(size=(10000, 5000))
    >>> df_y = pd.DataFrame(y, columns=np.arange(5000, 1000))
    
    >>> df_x[df_y.columns] = df_y
    
    

    internally when looking at the blocks, pandas has one block for the first 5k columns, and then one block for each single column in the next 5k columns, so 5k blocks (we can see this by looking at df_x._data).

So our actual issue seems to be the block manager that is not designed for a use case with thousands of columns and forces pandas to copy data around.

We’re investigating this 🕵️‍♂️ 🔦

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apply data validation to cells - Microsoft Support
Select the cells where you want to restrict data entry. · On the Data tab, click Data Validation > Data Validation. · In...
Read more >
Wireless Phones and the National Do-Not-Call List
The do-not-call rules require callers that are not exempt from the rules to stop telemarketing calls 30 days after you register a number....
Read more >
Only allow values contained in a list to be entered into a cell
http://www.TeachMsOffice.comThis tutorial will show you how to prevent a user from entering a value in a cell in excel which is not ...
Read more >
Excel Data Validation - Limit What a User Can Enter into a Cell
Data Validation is a tool in Excel that you can use to limit what a user can enter into a cell. It is...
Read more >
Easy Steps Excel Dependent Drop Down List Data Validation
To block changes to the first list, you can change the data validation formula, so the list does not appear unless the second...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found