Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Swifter incorrectly comparing results of pandas and dask applies

See original GitHub issue

Wrong comparison of 2 pandas Series

swifter==0.302
dask==2.14.0 
pandas==1.0.3

file swifter.py:289

self._validate_apply(
   tmp_df.equals(meta), error_message="Dask apply sample does not match pandas apply sample."
)

It compares 2 Series (in my case): meta = (0, 0.002689316907153639) (1, 0.0020169299881876556) (2, 0.0021525252276455888) (0, 0.0023806118812282305) (1, 0.002263126767581785) (2, 0.0023925398049665803) (0, 0.002505102859306909) (1, 0.0019670994913228703) (2, 0.0020991911781853417) (0, 0.0023227029 temp_df = (0, 0.002689316907153639) (0, 0.001975845345081718) (0, 0.002583021140662563) (0, 0.0022234801671238607) (0, 0.0021956561811590993) (0, 0.0023227029344862734) (0, 0.0027618463320199546) (0, 0.0023806118812282305) (0, 0.002505102859306909) (1, 0.00196709949

tmp_df.equals(meta) - False while tmp_df.sort_index().equals(meta.sort_index()) - True tmp_df.sort_values().equals(meta.sort_values()) - True

Is it correct?

PS. There is a suggestion is to change tmp_df and meta name to more obvious name to understand which came from what part (e.g. dask_smpl_apply_df and pd_smpl_apply_df) PSS. For some reason, Dask reduces the amount of partitions for the dataframe from 16 to 5

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jmcarpenter2commented, Apr 27, 2020

Hi @sann05 ,

I figured out what the issue was. Basically, the equality between dask and pandas was lost because of an indexing issue in the original csv. See the image below with repeated 0,1,2 indices on load into pandas.

Screen Shot 2020-04-27 at 7 53 54 AM

A simple solution that works currently is to reset the index before using pandas or swifter.

Screen Shot 2020-04-27 at 7 55 38 AM

Screen Shot 2020-04-27 at 7 58 07 AM

However, you raise a really important point. Swifter silently fails to improve performance when there are duplicate indices, i.e. the indices haven’t been reset. This needs to be handled better. I am curious what your opinion is. My thought is to raise a warning if someone tries to use swifter on a dataframe with duplicate indices, reminding them to reset before calling swifter. Let me know what kind of solution you think would be best, thanks!

Thanks for your input, Jason

0reactions

jmcarpenter2commented, Apr 28, 2020

Follow-up: resolving this issue in #108