dask dataframe drop_duplicates support?
See original GitHub issueedit:
This works
>>> df = pd.DataFrame({"A": [1, 2] * 4, "B": [1] * 4 + [2] * 4})
>>> a = dd.from_pandas(df, 2)
>>> a.drop_duplicates(subset=['A']).compute()
But this fails when subset
is specified positionally.
>>> a.drop_duplicates(['A']).compute()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-ad2309a8f442> in <module>
3 df = dd.from_pandas(pd.DataFrame({'a': [0, 0, 1], 'b': range(3)}), npartitions=1)
4
----> 5 df.drop_duplicates(['a']).compute()
~/sandbox/dask/dask/dataframe/core.py in drop_duplicates(self, split_every, split_out, **kwargs)
506 split_out_setup=split_out_setup,
507 split_out_setup_kwargs=split_out_setup_kwargs,
--> 508 **kwargs
509 )
510
~/sandbox/dask/dask/dataframe/core.py in apply_concat_apply(args, chunk, aggregate, combine, meta, token, chunk_kwargs, aggregate_kwargs, combine_kwargs, split_every, split_out, split_out_setup, split_out_setup_kwargs, **kwargs)
4597 elif split_every is False:
4598 split_every = npartitions
-> 4599 elif split_every < 2 or not isinstance(split_every, Integral):
4600 raise ValueError("split_every must be an integer >= 2")
4601
TypeError: '<' not supported between instances of 'list' and 'int'
We need to make subset
a proper keyword in
and update lines like
to just be
if subset is not None:
....
The original df is a dask dataframe
This works:
df = df.compute()
columns=['a','b','c']
df = df.drop_duplicates(columns)
This does not, is there a better approach?
columns=['a','b','c']
(Pdb) df = df.drop_duplicates(columns)
*** TypeError: unorderable types: list() < int()
Issue Analytics
- State:
- Created 6 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
DataFrame.drop_duplicates - Dask documentation
Return DataFrame with duplicate rows removed. This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates. Some inconsistencies with the Dask ...
Read more >Dask Dataframe: Remove duplicates by columns A, keeping ...
Remove duplicates by columns A, keeping the row with the highest value in column B. In this case, your pandas solution of df.sort_values('B' ......
Read more >Dask and Pandas: There's No Such Thing as Too Much Data
In this article, we'll discuss a few of the areas where users might find that Dask tools help expand the existing pandas functionality, ......
Read more >[Dask] Concat two dataframes + delete duplicates : r/learnpython
ddftest.info() gives me 305 partitions (Dask dataframe), ... If so, simply use the concat and drop.duplicates() functions as described in my ...
Read more >Dask DataFrames - Practical Data Science
In this tutorial, we will use dask.dataframe to do parallel operations on dask dataframes look and feel like Pandas dataframes but they run...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Actually, I think this works fine and the docs are just outdated.
Interested. Reading contribution guidelines (previous contribution was docs).