New Feature Request: Add support for drop_duplicates()
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.04
- Modin installed from (source or binary): binary, pip install modin
- Modin version: 0.5.0
- Python version: 3.7.3
- Exact command to reproduce: Use drop_duplicates()
Describe the problem
drop_duplicates() is not supported today resulting in the following message when used:
UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
UserWarning: `DataFrame.duplicated` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: `Series.__array__` defaulting to pandas implementation.
Source code / logs
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (8 by maintainers)
Top Results From Across the Web
spark: How to do a dropDuplicates on a dataframe while ...
I found the drop_duplicate method (I'm using pyspark), but one don't have control on which item will be kept. Anyone can help ?...
Read more >distinct() vs dropDuplicates() in Apache Spark
The dropDuplicates() method ... Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. For a static ...
Read more >Delete duplicate records with a query - Microsoft Support
Create and run a delete query · Click the Create tab > Query Design and double-click the table from which you want to...
Read more >How to Remove Duplicates in Google Sheets in Five Different ...
Method 1: How to remove duplicates in Google Sheets with the Remove Duplicates tool. The new feature is super easy to use. You...
Read more >FAQ: Identifying and dropping duplicate observations - Stata
Having created the new variable dup, you could then ... start with the data used in example 1, but this time we drop...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We may want to put some of this logic into the PandasQueryCompiler so that it can be used by other implementations.
What about this:
in
dataframe.py
:in
series.py
:The
base.py
code in your answer assumed a dataframe input, so that code probably belongs indataframe.py
.Feature added via #892