[BUG] drop_duplicates does not drop all duplicates
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
- Modin version (
modin.__version__
): 0.8.0 - Python version: 3.6.9
- Code we can use to reproduce:
import modin.pandas as pd
import pandas
pdf=pandas.DataFrame(
[[5, 'ssssss0'], [0, 'ssssss4'], [3, 'ssssss5'], [3, 'ssssss5'], [7, 'ssssss6'],
[9, 'ssssss8'], [3, 'ssssss4'], [5, 'ssssss1'], [2, 'ssssss4'], [4, 'ssssss9'],
[7, 'ssssss8'], [6, 'ssssss1'], [8, 'ssssss1'], [8, 'ssssss7'], [1, 'ssssss9'],
[6, 'ssssss9'], [7, 'ssssss3'], [7, 'ssssss6'], [8, 'ssssss7'], [1, 'ssssss2'],
[5, 'ssssss0'], [9, 'ssssss3'], [8, 'ssssss5'], [9, 'ssssss9'], [4, 'ssssss4'],
[3, 'ssssss4'], [0, 'ssssss6'], [3, 'ssssss4'], [5, 'ssssss4'], [0, 'ssssss3'],
[2, 'ssssss4'], [3, 'ssssss4'], [8, 'ssssss8'], [1, 'ssssss4'], [3, 'ssssss3'],
[3, 'ssssss7'], [3, 'ssssss5'], [7, 'ssssss5'], [0, 'ssssss0'], [1, 'ssssss1'],
[9, 'ssssss5'], [9, 'ssssss9'], [0, 'ssssss3'], [4, 'ssssss0'], [7, 'ssssss5'],
[3, 'ssssss0'], [2, 'ssssss1'], [7, 'ssssss2'], [2, 'ssssss4'], [0, 'ssssss2']], columns=['a', 'b'])
mdf = pd.DataFrame(pdf)
print(pdf.drop_duplicates().shape)
print(mdf.drop_duplicates().shape)
Describe the problem
Modin does not drop all duplicates (with ray backend)
Source code / logs
(37, 2)
(41, 2)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
drop_duplicates not working in pandas? - Stack Overflow
drop_duplicates () has to have a match in ALL subsets for dropping a row. So for deleting multiple based on only the one...
Read more >Pandas Drop Duplicates, Explained - Sharp Sight
This tutorial explains the Pandas Drop Duplicates technique. It shows how to remove duplicates from a Pandas dataframe with clear examples.
Read more >Pandas Drop Duplicates – pd.df.drop_duplicates()
Pandas Drop Duplicates - .drop_duplicates() looks through your DataFrame and drops any duplicate rows or rows with duplicate column subsets.
Read more >How To Drop Duplicates Using Drop_duplicates() Function In ...
In this Python tutorial, we will learn how to drop duplicates using drop_duplicates() function in python pandas. Datasets used in this blog ...
Read more >Pandas drop_duplicates: Drop Duplicate Rows in ... - Datagy
drop_duplicates () method to the DataFrame. By default, this drops any duplicate records across all columns. How do you drop duplicates in Pandas ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@devin-petersohn All works when
dtype
of columns is numeric or boolean (our tests check this), but if we have columns with differentdtypes
(numeric and string, for example) we will have problem which was described here (our tests don’t check this). I will fix this in #1994.There should a reduction because it is happening with a call to
apply
, which handles whole axis together.https://github.com/modin-project/modin/blob/f6b60404f66a98c1371791178f86613e362e52f9/modin/pandas/dataframe.py#L288-L292