question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask dataframe drop_duplicates support?

See original GitHub issue

edit:

This works

>>> df = pd.DataFrame({"A": [1, 2] * 4, "B": [1] * 4 + [2] * 4})
>>> a = dd.from_pandas(df, 2)
>>> a.drop_duplicates(subset=['A']).compute()

But this fails when subset is specified positionally.

>>> a.drop_duplicates(['A']).compute()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-ad2309a8f442> in <module>
      3 df = dd.from_pandas(pd.DataFrame({'a': [0, 0, 1], 'b': range(3)}), npartitions=1)
      4
----> 5 df.drop_duplicates(['a']).compute()

~/sandbox/dask/dask/dataframe/core.py in drop_duplicates(self, split_every, split_out, **kwargs)
    506             split_out_setup=split_out_setup,
    507             split_out_setup_kwargs=split_out_setup_kwargs,
--> 508             **kwargs
    509         )
    510

~/sandbox/dask/dask/dataframe/core.py in apply_concat_apply(args, chunk, aggregate, combine, meta, token, chunk_kwargs, aggregate_kwargs, combine_kwargs, split_every, split_out, split_out_setup, split_out_setup_kwargs, **kwargs)
   4597     elif split_every is False:
   4598         split_every = npartitions
-> 4599     elif split_every < 2 or not isinstance(split_every, Integral):
   4600         raise ValueError("split_every must be an integer >= 2")
   4601

TypeError: '<' not supported between instances of 'list' and 'int'

We need to make subset a proper keyword in

https://github.com/dask/dask/blob/96bbf636c50a753b76ba4758d03aebb0a42386db/dask/dataframe/core.py#L485

and update lines like

https://github.com/dask/dask/blob/96bbf636c50a753b76ba4758d03aebb0a42386db/dask/dataframe/core.py#L488

to just be

if subset is not None:
    ....

The original df is a dask dataframe

This works:

df = df.compute()
columns=['a','b','c']
df = df.drop_duplicates(columns)

This does not, is there a better approach?

columns=['a','b','c']
(Pdb) df = df.drop_duplicates(columns)
*** TypeError: unorderable types: list() < int()

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

5reactions
dseverocommented, Jan 29, 2019

Actually, I think this works fine and the docs are just outdated.

In [11]: import dask.dataframe as dd                
    ...: import pandas as pd
    ...: df = dd.from_pandas(pd.DataFrame({'a': [0, 0, 1], 'b': range(3)}), npartitions=1) 

In [12]: df.drop_duplicates(subset=['a']).compute() 
Out[12]:                     
   a  b                                                                                    
0  0  0                    
2  1  2 
1reaction
WesRoachcommented, Sep 9, 2019

Interested. Reading contribution guidelines (previous contribution was docs).

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataFrame.drop_duplicates - Dask documentation
Return DataFrame with duplicate rows removed. This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates. Some inconsistencies with the Dask ...
Read more >
Dask Dataframe: Remove duplicates by columns A, keeping ...
Remove duplicates by columns A, keeping the row with the highest value in column B. In this case, your pandas solution of df.sort_values('B' ......
Read more >
Dask and Pandas: There's No Such Thing as Too Much Data
In this article, we'll discuss a few of the areas where users might find that Dask tools help expand the existing pandas functionality, ......
Read more >
[Dask] Concat two dataframes + delete duplicates : r/learnpython
ddftest.info() gives me 305 partitions (Dask dataframe), ... If so, simply use the concat and drop.duplicates() functions as described in my ...
Read more >
Dask DataFrames - Practical Data Science
In this tutorial, we will use dask.dataframe to do parallel operations on dask dataframes look and feel like Pandas dataframes but they run...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found