Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can't drop duplicated on dask dataframe index

See original GitHub issue

please note my stackoverflow question

I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.

When using pandas i would use

df = df[~df.index.duplicated(keep = "first")]

And it works

When trying to do the same with dask dataframe i get

AttributeError: ‘Index’ object has no attribute ‘duplicated’

I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible

I could use df.compute() and than drop the duplicated index values but this df is too big for memory.

I tried using the following code as suggested by jezrael in stackoverflow rxTable[~rxTable.index.to_Series().duplicated()] and got

AttributeError: ‘Index’ object has no attribute ‘to_Series’

It worked a few days ago and just stopped, i can’t find any difference in the code and data.

How can i drop the duplicated index values from my dataframe using dask dataframe?

Thanks

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

jsignellcommented, Jun 3, 2020

@3ggaurav you can read about split_every and split_out here:

    split_every : int, optional
        Group partitions into groups of this size while performing a
        tree-reduction. If set to False, no tree-reduction will be used,
        and all intermediates will be concatenated and passed to ``aggregate``.
        Default is 8.
    split_out : int, optional
        Number of output partitions. Split occurs after first chunk reduction.

@Demirrr this is a very old issue. I think you’d be better off setting your index as a column and sing drop_duplicates:

import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({"A": [1, 2, 2, 3, 2, 2]},
                 index=[0, 0, 1, 1, 2, 2])
b = dd.from_pandas(a, 2)
out = b.reset_index().drop_duplicates(["index"]).set_index("index")
out.compute()

I am closing this. But if you are still having issues, please reopen.

1reaction

TomAugspurgercommented, Dec 4, 2017

Reminder, it’s helpful to have reproducible examples 😃

This may be your best shot right now:

import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({"A": [1, 2, 3, 4]},
                 index=[0, 0, 1, 1])
b = dd.from_pandas(a, 2)

a.groupby(a.index).first()

def chunk(x):
    return x.first()


def agg(x):
    return x.first()

A PR implementing first and last on groupby objects would be helpful if you have time.

Top Results From Across the Web

dask dataframe drop duplicate index values - Stack Overflow

I think you need convert index to Series by to_series , keep='first' should be omit, because default parameter in duplicated :

DataFrame.drop_duplicates - Dask documentation

Return DataFrame with duplicate rows removed. This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates. Some inconsistencies with the Dask ...

Dataframe indexes - Dask Forum

If I merge by the index, I get a duplicated column. In general, I'd suggest exactly this, merging on the index. You can...

[Dask] Concat two dataframes + delete duplicates : r/learnpython

Hi all, I'm having trouble with merging and subsequently deleting duplicates from a rather big dataset. So, I have two datasets: Dataset 1: ......

A short introduction to Dask for Pandas developers

How do Dask dataframes handle Pandas dataframes? ... A Dask dataframe knows only, ... Additionally, it might know the smallest and largest index...