question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can't drop duplicated on dask dataframe index

See original GitHub issue

please note my stackoverflow question

I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.

When using pandas i would use

df = df[~df.index.duplicated(keep = "first")]

And it works

When trying to do the same with dask dataframe i get

AttributeError: ‘Index’ object has no attribute ‘duplicated’

I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible

I could use df.compute() and than drop the duplicated index values but this df is too big for memory.

I tried using the following code as suggested by jezrael in stackoverflow rxTable[~rxTable.index.to_Series().duplicated()] and got

AttributeError: ‘Index’ object has no attribute ‘to_Series’

It worked a few days ago and just stopped, i can’t find any difference in the code and data.

How can i drop the duplicated index values from my dataframe using dask dataframe?

Thanks

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Jun 3, 2020

@3ggaurav you can read about split_every and split_out here:

    split_every : int, optional
        Group partitions into groups of this size while performing a
        tree-reduction. If set to False, no tree-reduction will be used,
        and all intermediates will be concatenated and passed to ``aggregate``.
        Default is 8.
    split_out : int, optional
        Number of output partitions. Split occurs after first chunk reduction.

@Demirrr this is a very old issue. I think you’d be better off setting your index as a column and sing drop_duplicates:

import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({"A": [1, 2, 2, 3, 2, 2]},
                 index=[0, 0, 1, 1, 2, 2])
b = dd.from_pandas(a, 2)
out = b.reset_index().drop_duplicates(["index"]).set_index("index")
out.compute()

I am closing this. But if you are still having issues, please reopen.

1reaction
TomAugspurgercommented, Dec 4, 2017

Reminder, it’s helpful to have reproducible examples 😃

This may be your best shot right now:

import pandas as pd
import dask.dataframe as dd

a = pd.DataFrame({"A": [1, 2, 3, 4]},
                 index=[0, 0, 1, 1])
b = dd.from_pandas(a, 2)

a.groupby(a.index).first()

def chunk(x):
    return x.first()


def agg(x):
    return x.first()

A PR implementing first and last on groupby objects would be helpful if you have time.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask dataframe drop duplicate index values - Stack Overflow
I think you need convert index to Series by to_series , keep='first' should be omit, because default parameter in duplicated :
Read more >
DataFrame.drop_duplicates - Dask documentation
Return DataFrame with duplicate rows removed. This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates. Some inconsistencies with the Dask ...
Read more >
Dataframe indexes - Dask Forum
If I merge by the index, I get a duplicated column. In general, I'd suggest exactly this, merging on the index. You can...
Read more >
[Dask] Concat two dataframes + delete duplicates : r/learnpython
Hi all, I'm having trouble with merging and subsequently deleting duplicates from a rather big dataset. So, I have two datasets: Dataset 1: ......
Read more >
A short introduction to Dask for Pandas developers
How do Dask dataframes handle Pandas dataframes? ... A Dask dataframe knows only, ... Additionally, it might know the smallest and largest index...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found