Slow Performance of Swifter for Text Preprocessing
See original GitHub issueHi @jmcarpenter2,
Dear Swifter Folks,
Recently, i found the speed when using swifter is 5-10x slower than using vanilla pandas apply for case that the process is not vectorized (my case is doing text preprocessing).
The experiment is like this:
import pandas as pd
import swifter
def clean_text(text):
text = text.strip()
text = text.replace(' ', '_')
return text
N_rows = 7000000
df_data = pd.DataFrame([["i want to break free"]] * N_rows, columns=["text"])
%time df_data['text'] = df_data['text'].swifter.apply(clean_text)
%time df_data['text'] = df_data['text'].apply(clean_text)
Is it expected? let’s have a discussion to make sure i’m not missing something. Thank you!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:26 (10 by maintainers)
Top Results From Across the Web
Really Slow Array Performance - Using Swift - Swift Forums
Hey I decided to learn how to make Swift go fast :rocket: I optimised my code with pointers and a lot more, inlining...
Read more >Vectorized form of cleaning function for NLP - Stack Overflow
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed...
Read more >Text pre-processing: Stop words removal using different libraries
By removing these words, we remove the low-level information from our text in order to give more focus to the important information.
Read more >SwiftUI TextEditor performance iss… | Apple Developer Forums
I'm currently trying it with a 400K text string and typing into the TextEditor horribly slow. Each letter I type takes seconds to...
Read more >What is Text Mining? - IBM
Text Mining · Structured data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store...
Read more >Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For anyone reading this issue –
If you are doing processing on text data and want to try to increase speed with swifter, you should try adding
allow_dask_on_strings()
to your command chain.For example
df.swifter.allow_dask_on_strings().apply(foo)
will allow swifter to attempt using dask on your text data, which by default is not allowed.Please see the discussion above for why this is the default. Long story short: it can actually run slower than a pandas apply.
So if you are experiencing a lack of performance boost from swifter and you have text data in your dataframe, try
allow_dask_on_strings()
. It is more likely to increase speed if the text column of the dataframe is only used as a lookup rather than be mutated by the function call itself.I haven’t been using the library for a while but it’s really great to see things resolved with such awesome bug closing notes. Folks like you make the OSS world the magical place that it is. Thank you so much Jason!