Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask dataframe drop_duplicates support?

See original GitHub issue

edit:

This works

>>> df = pd.DataFrame({"A": [1, 2] * 4, "B": [1] * 4 + [2] * 4})
>>> a = dd.from_pandas(df, 2)
>>> a.drop_duplicates(subset=['A']).compute()

But this fails when subset is specified positionally.

>>> a.drop_duplicates(['A']).compute()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-ad2309a8f442> in <module>
      3 df = dd.from_pandas(pd.DataFrame({'a': [0, 0, 1], 'b': range(3)}), npartitions=1)
      4
----> 5 df.drop_duplicates(['a']).compute()

~/sandbox/dask/dask/dataframe/core.py in drop_duplicates(self, split_every, split_out, **kwargs)
    506             split_out_setup=split_out_setup,
    507             split_out_setup_kwargs=split_out_setup_kwargs,
--> 508             **kwargs
    509         )
    510

~/sandbox/dask/dask/dataframe/core.py in apply_concat_apply(args, chunk, aggregate, combine, meta, token, chunk_kwargs, aggregate_kwargs, combine_kwargs, split_every, split_out, split_out_setup, split_out_setup_kwargs, **kwargs)
   4597     elif split_every is False:
   4598         split_every = npartitions
-> 4599     elif split_every < 2 or not isinstance(split_every, Integral):
   4600         raise ValueError("split_every must be an integer >= 2")
   4601

TypeError: '<' not supported between instances of 'list' and 'int'

We need to make subset a proper keyword in

https://github.com/dask/dask/blob/96bbf636c50a753b76ba4758d03aebb0a42386db/dask/dataframe/core.py#L485

and update lines like

https://github.com/dask/dask/blob/96bbf636c50a753b76ba4758d03aebb0a42386db/dask/dataframe/core.py#L488

to just be

if subset is not None:
    ....

The original df is a dask dataframe

This works:

df = df.compute()
columns=['a','b','c']
df = df.drop_duplicates(columns)

This does not, is there a better approach?

columns=['a','b','c']
(Pdb) df = df.drop_duplicates(columns)
*** TypeError: unorderable types: list() < int()

Issue Analytics

State:
Created 6 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

5reactions

dseverocommented, Jan 29, 2019

Actually, I think this works fine and the docs are just outdated.

In [11]: import dask.dataframe as dd                
    ...: import pandas as pd
    ...: df = dd.from_pandas(pd.DataFrame({'a': [0, 0, 1], 'b': range(3)}), npartitions=1) 

In [12]: df.drop_duplicates(subset=['a']).compute() 
Out[12]:                     
   a  b                                                                                    
0  0  0                    
2  1  2

1reaction

WesRoachcommented, Sep 9, 2019

Interested. Reading contribution guidelines (previous contribution was docs).

Top Results From Across the Web

DataFrame.drop_duplicates - Dask documentation

Return DataFrame with duplicate rows removed. This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates. Some inconsistencies with the Dask ...

Dask Dataframe: Remove duplicates by columns A, keeping ...

Remove duplicates by columns A, keeping the row with the highest value in column B. In this case, your pandas solution of df.sort_values('B' ......

Dask and Pandas: There's No Such Thing as Too Much Data

In this article, we'll discuss a few of the areas where users might find that Dask tools help expand the existing pandas functionality, ......

[Dask] Concat two dataframes + delete duplicates : r/learnpython

ddftest.info() gives me 305 partitions (Dask dataframe), ... If so, simply use the concat and drop.duplicates() functions as described in my ...

Dask DataFrames - Practical Data Science

In this tutorial, we will use dask.dataframe to do parallel operations on dask dataframes look and feel like Pandas dataframes but they run...