question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Swifter incorrectly comparing results of pandas and dask applies

See original GitHub issue

Wrong comparison of 2 pandas Series

swifter==0.302
dask==2.14.0 
pandas==1.0.3

file swifter.py:289

self._validate_apply(
   tmp_df.equals(meta), error_message="Dask apply sample does not match pandas apply sample."
)

It compares 2 Series (in my case): meta = (0, 0.002689316907153639) (1, 0.0020169299881876556) (2, 0.0021525252276455888) (0, 0.0023806118812282305) (1, 0.002263126767581785) (2, 0.0023925398049665803) (0, 0.002505102859306909) (1, 0.0019670994913228703) (2, 0.0020991911781853417) (0, 0.0023227029 temp_df = (0, 0.002689316907153639) (0, 0.001975845345081718) (0, 0.002583021140662563) (0, 0.0022234801671238607) (0, 0.0021956561811590993) (0, 0.0023227029344862734) (0, 0.0027618463320199546) (0, 0.0023806118812282305) (0, 0.002505102859306909) (1, 0.00196709949

tmp_df.equals(meta) - False while tmp_df.sort_index().equals(meta.sort_index()) - True tmp_df.sort_values().equals(meta.sort_values()) - True

Is it correct?

PS. There is a suggestion is to change tmp_df and meta name to more obvious name to understand which came from what part (e.g. dask_smpl_apply_df and pd_smpl_apply_df) PSS. For some reason, Dask reduces the amount of partitions for the dataframe from 16 to 5

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jmcarpenter2commented, Apr 27, 2020

Hi @sann05 ,

I figured out what the issue was. Basically, the equality between dask and pandas was lost because of an indexing issue in the original csv. See the image below with repeated 0,1,2 indices on load into pandas.

Screen Shot 2020-04-27 at 7 53 54 AM

A simple solution that works currently is to reset the index before using pandas or swifter.

Screen Shot 2020-04-27 at 7 55 38 AM

Screen Shot 2020-04-27 at 7 58 07 AM

However, you raise a really important point. Swifter silently fails to improve performance when there are duplicate indices, i.e. the indices haven’t been reset. This needs to be handled better. I am curious what your opinion is. My thought is to raise a warning if someone tries to use swifter on a dataframe with duplicate indices, reminding them to reset before calling swifter. Let me know what kind of solution you think would be best, thanks!

Thanks for your input, Jason

0reactions
jmcarpenter2commented, Apr 28, 2020

Follow-up: resolving this issue in #108

Read more comments on GitHub >

github_iconTop Results From Across the Web

swifter for big data · Issue #144 - GitHub
I'm so embarrassed that I use swifter to apply very big pandas DataFrame. but it cause the compute crash, because it used all...
Read more >
Speed up your Pandas Processing with Swifter
Swifter is a package that tries to efficiently apply any function to a Pandas Data Frame or Series object in the quickest available...
Read more >
Swifter — automatically efficient pandas apply operations
Failing that, it automatically decides whether it is faster to perform dask parallel processing or use a simple pandas apply.
Read more >
Dask doesn't group/apply the results properly compared to ...
Since the operations are not interdependant, I figured I'd use Dask to parallelize the processing of each group seperately. So I use this...
Read more >
12 Ways to Apply a Function to Each Row in a DataFrame
5 simple yet faster alternatives to Pandas apply and iterrow methods. ... Problem. Recently, I was analyzing user behavior data for an ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found