Can we avoid having a cell with a list?
See original GitHub issueAs we know, it’s really not recommended to store a list in a Pandas
cell. TokenSeries
and VectorSeries
, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?
Need to discuss:
- Alternatives using sub-columns (it’s still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex …
- Can we just use RepresentationSeries? Probably not as we cannot merge it into a
DataFrame
with a single index, other alternatives than data alignment with reindex (too complicated)?
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (5 by maintainers)
Top Results From Across the Web
Apply data validation to cells - Microsoft Support
Select the cells where you want to restrict data entry. · On the Data tab, click Data Validation > Data Validation. · In...
Read more >Wireless Phones and the National Do-Not-Call List
The do-not-call rules require callers that are not exempt from the rules to stop telemarketing calls 30 days after you register a number....
Read more >Only allow values contained in a list to be entered into a cell
http://www.TeachMsOffice.comThis tutorial will show you how to prevent a user from entering a value in a cell in excel which is not ...
Read more >Excel Data Validation - Limit What a User Can Enter into a Cell
Data Validation is a tool in Excel that you can use to limit what a user can enter into a cell. It is...
Read more >Easy Steps Excel Dependent Drop Down List Data Validation
To block changes to the first list, you can change the data validation formula, so the list does not appear unless the second...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
After some more discussions, we have come to the following conclusions:
Conclusions:
VectorSeries
andTokenSeries
as-is. Reason: inserting into existing DataFrames is too expensive for usersRepresentationSeries
output oftfidf, count, term_frequency
to new sparseVectorDF
that we will henceforth callDocumentTermDF
. Reason: it looks nicer and is just as performantDocumentTermDF
andVectorSeries
input(Not so positive) update:
We have sadly now noticed this (so what we’re doing not just with Sparse stuff but overall in this issue is maybe not a viable solution after all 😕):
So our main issue is that we want to
df["pca"] = ...
.The problem we’re now facing with our implementation:
When inserting a big matrix, so a DF with maybe 1000 subcolumns, pandas starts acting weird due to its block manager. See HERE for a great introduction to the topic. We’re basically looking for a way to performantly add many many columns to a DF.
Two things happen:
Pandas tries to consolidate columns of the same dtype into “blocks”, which requires copying data around. If we now insert 5000 new columns, all the data has to be copied instead of just referenced.
Weirdly, when doing
internally when looking at the blocks, pandas has one block for the first 5k columns, and then one block for each single column in the next 5k columns, so 5k blocks (we can see this by looking at
df_x._data
).So our actual issue seems to be the block manager that is not designed for a use case with thousands of columns and forces pandas to copy data around.
We’re investigating this 🕵️♂️ 🔦