Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New Feature Request: Add support for drop_duplicates()

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.04
Modin installed from (source or binary): binary, pip install modin
Modin version: 0.5.0
Python version: 3.7.3
Exact command to reproduce: Use drop_duplicates()

Describe the problem

drop_duplicates() is not supported today resulting in the following message when used:

UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
UserWarning: `DataFrame.duplicated` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: `Series.__array__` defaulting to pandas implementation.

Source code / logs

Issue Analytics

State:
Created 4 years ago
Comments:10 (8 by maintainers)

Top GitHub Comments

1reaction

devin-petersohncommented, Dec 2, 2019

We may want to put some of this logic into the PandasQueryCompiler so that it can be used by other implementations.

What about this:

in dataframe.py:

    def duplicated(self, subset=None, keep="first"):
        df = self[subset] if subset is not None else self
        # if the number of columns we are checking for duplicates is larger than 1, we must
        # hash them to generate a single value that can be compared across rows.
        if len(df.columns > 1):
            hashed = df.apply(lambda s: hash(s.to_numpy().data.tobytes()), axis=1)).to_frame()
        else:
            hashed = self
        return hashed.apply(lambda s: s.duplicated(keep=keep)).squeeze(axis=1)

in series.py:

    def duplicated(self, keep="first"):
        return self.to_frame().duplicated(keep=keep)

The base.py code in your answer assumed a dataframe input, so that code probably belongs in dataframe.py.

0reactions

devin-petersohncommented, Dec 9, 2019

Feature added via #892

Top Results From Across the Web

spark: How to do a dropDuplicates on a dataframe while ...

I found the drop_duplicate method (I'm using pyspark), but one don't have control on which item will be kept. Anyone can help ?...

distinct() vs dropDuplicates() in Apache Spark

The dropDuplicates() method ... Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. For a static ...

Delete duplicate records with a query - Microsoft Support

Create and run a delete query · Click the Create tab > Query Design and double-click the table from which you want to...

How to Remove Duplicates in Google Sheets in Five Different ...

Method 1: How to remove duplicates in Google Sheets with the Remove Duplicates tool. The new feature is super easy to use. You...

FAQ: Identifying and dropping duplicate observations - Stata

Having created the new variable dup, you could then ... start with the data used in example 1, but this time we drop...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

New Feature Request: Add support for drop_duplicates()

System information

Describe the problem

Source code / logs

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

AttributeError: type object 'Callable' has no attribute '_abc_registry'

read_parquet is not supported for partitioned parquet