question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New Feature Request: Add support for drop_duplicates()

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.04
  • Modin installed from (source or binary): binary, pip install modin
  • Modin version: 0.5.0
  • Python version: 3.7.3
  • Exact command to reproduce: Use drop_duplicates()

Describe the problem

drop_duplicates() is not supported today resulting in the following message when used:

UserWarning: User-defined function verification is still under development in Modin. The function provided is not verified.
UserWarning: `DataFrame.duplicated` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: `Series.__array__` defaulting to pandas implementation.

Source code / logs

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
devin-petersohncommented, Dec 2, 2019

We may want to put some of this logic into the PandasQueryCompiler so that it can be used by other implementations.

What about this:

in dataframe.py:

    def duplicated(self, subset=None, keep="first"):
        df = self[subset] if subset is not None else self
        # if the number of columns we are checking for duplicates is larger than 1, we must
        # hash them to generate a single value that can be compared across rows.
        if len(df.columns > 1):
            hashed = df.apply(lambda s: hash(s.to_numpy().data.tobytes()), axis=1)).to_frame()
        else:
            hashed = self
        return hashed.apply(lambda s: s.duplicated(keep=keep)).squeeze(axis=1)

in series.py:

    def duplicated(self, keep="first"):
        return self.to_frame().duplicated(keep=keep)

The base.py code in your answer assumed a dataframe input, so that code probably belongs in dataframe.py.

0reactions
devin-petersohncommented, Dec 9, 2019

Feature added via #892

Read more comments on GitHub >

github_iconTop Results From Across the Web

spark: How to do a dropDuplicates on a dataframe while ...
I found the drop_duplicate method (I'm using pyspark), but one don't have control on which item will be kept. Anyone can help ?...
Read more >
distinct() vs dropDuplicates() in Apache Spark
The dropDuplicates() method ... Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. For a static ...
Read more >
Delete duplicate records with a query - Microsoft Support
Create and run a delete query · Click the Create tab > Query Design and double-click the table from which you want to...
Read more >
How to Remove Duplicates in Google Sheets in Five Different ...
Method 1: How to remove duplicates in Google Sheets with the Remove Duplicates tool. The new feature is super easy to use. You...
Read more >
FAQ: Identifying and dropping duplicate observations - Stata
Having created the new variable dup, you could then ... start with the data used in example 1, but this time we drop...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found