can't drop duplicated on dask dataframe index
See original GitHub issueplease note my stackoverflow question
I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.
When using pandas i would use
df = df[~df.index.duplicated(keep = "first")]
And it works
When trying to do the same with dask dataframe i get
AttributeError: ‘Index’ object has no attribute ‘duplicated’
I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible
I could use df.compute() and than drop the duplicated index values but this df is too big for memory.
I tried using the following code as suggested by jezrael in stackoverflow
rxTable[~rxTable.index.to_Series().duplicated()]
and got
AttributeError: ‘Index’ object has no attribute ‘to_Series’
It worked a few days ago and just stopped, i can’t find any difference in the code and data.
How can i drop the duplicated index values from my dataframe using dask dataframe?
Thanks
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:9 (4 by maintainers)
Top GitHub Comments
@3ggaurav you can read about
split_every
andsplit_out
here:@Demirrr this is a very old issue. I think you’d be better off setting your index as a column and sing
drop_duplicates
:I am closing this. But if you are still having issues, please reopen.
Reminder, it’s helpful to have reproducible examples 😃
This may be your best shot right now:
A PR implementing
first
andlast
on groupby objects would be helpful if you have time.