Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can we avoid having a cell with a list?

See original GitHub issue

As we know, it’s really not recommended to store a list in a Pandas cell. TokenSeries and VectorSeries, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?

Need to discuss:

Alternatives using sub-columns (it’s still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex …
Can we just use RepresentationSeries? Probably not as we cannot merge it into a DataFrame with a single index, other alternatives than data alignment with reindex (too complicated)?

@mk2510 @henrifroese

Issue Analytics

State:
Created 3 years ago
Comments:18 (5 by maintainers)

Top GitHub Comments

1reaction

henrifroesecommented, Aug 18, 2020

After some more discussions, we have come to the following conclusions:

Conclusions:

Keep VectorSeries and TokenSeries as-is. Reason: inserting into existing DataFrames is too expensive for users
Change RepresentationSeries output of tfidf, count, term_frequency to new sparse VectorDF that we will henceforth call DocumentTermDF. Reason: it looks nicer and is just as performant
All Dimensionality Reduction Functions and Clustering Functions will support both DocumentTermDF and VectorSeries input

1reaction

henrifroesecommented, Aug 16, 2020

(Not so positive) update:

We have sadly now noticed this (so what we’re doing not just with Sparse stuff but overall in this issue is maybe not a viable solution after all 😕):

So our main issue is that we want to

store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with “Subcolumns”)
and allow users to place this in their DataFrame with df["pca"] = ....

The problem we’re now facing with our implementation:

When inserting a big matrix, so a DF with maybe 1000 subcolumns, pandas starts acting weird due to its block manager. See HERE for a great introduction to the topic. We’re basically looking for a way to performantly add many many columns to a DF.

Two things happen:

Pandas tries to consolidate columns of the same dtype into “blocks”, which requires copying data around. If we now insert 5000 new columns, all the data has to be copied instead of just referenced.
Weirdly, when doing
```
>>> x = np.random.normal(size=(10000, 5000))
>>> df_x = pd.DataFrame(x)
>>> y = np.random.normal(size=(10000, 5000))
>>> df_y = pd.DataFrame(y, columns=np.arange(5000, 1000))

>>> df_x[df_y.columns] = df_y
```
internally when looking at the blocks, pandas has one block for the first 5k columns, and then one block for each single column in the next 5k columns, so 5k blocks (we can see this by looking at df_x._data).

So our actual issue seems to be the block manager that is not designed for a use case with thousands of columns and forces pandas to copy data around.

We’re investigating this 🕵️‍♂️ 🔦