Swifter incorrectly comparing results of pandas and dask applies
See original GitHub issueWrong comparison of 2 pandas Series
swifter==0.302
dask==2.14.0
pandas==1.0.3
file swifter.py:289
self._validate_apply(
tmp_df.equals(meta), error_message="Dask apply sample does not match pandas apply sample."
)
It compares 2 Series (in my case): meta = (0, 0.002689316907153639) (1, 0.0020169299881876556) (2, 0.0021525252276455888) (0, 0.0023806118812282305) (1, 0.002263126767581785) (2, 0.0023925398049665803) (0, 0.002505102859306909) (1, 0.0019670994913228703) (2, 0.0020991911781853417) (0, 0.0023227029 temp_df = (0, 0.002689316907153639) (0, 0.001975845345081718) (0, 0.002583021140662563) (0, 0.0022234801671238607) (0, 0.0021956561811590993) (0, 0.0023227029344862734) (0, 0.0027618463320199546) (0, 0.0023806118812282305) (0, 0.002505102859306909) (1, 0.00196709949
tmp_df.equals(meta) - False while tmp_df.sort_index().equals(meta.sort_index()) - True tmp_df.sort_values().equals(meta.sort_values()) - True
Is it correct?
PS. There is a suggestion is to change tmp_df and meta name to more obvious name to understand which came from what part (e.g. dask_smpl_apply_df and pd_smpl_apply_df) PSS. For some reason, Dask reduces the amount of partitions for the dataframe from 16 to 5
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi @sann05 ,
I figured out what the issue was. Basically, the equality between dask and pandas was lost because of an indexing issue in the original csv. See the image below with repeated 0,1,2 indices on load into pandas.
A simple solution that works currently is to reset the index before using pandas or swifter.
However, you raise a really important point. Swifter silently fails to improve performance when there are duplicate indices, i.e. the indices haven’t been reset. This needs to be handled better. I am curious what your opinion is. My thought is to raise a warning if someone tries to use swifter on a dataframe with duplicate indices, reminding them to reset before calling swifter. Let me know what kind of solution you think would be best, thanks!
Thanks for your input, Jason
Follow-up: resolving this issue in #108