Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Repartition randomly dropping some rows

See original GitHub issue

Bug report

When I called set_index(..., sorted=True), I got a warning saying partition indices have overlap, which is fine. From here, after calling repartition, I lost a row. This must be a bug in the repartition code since I don’t see why repartition would make me lose rows.

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> A = pd.DataFrame({'key': [1,2,3,4,4,5,6,7], 'value': list('abcd'*2)})
>>> a = dd.from_pandas(A, npartitions=2)
>>> a = a.set_index('key', sorted=True)
/Users/ctj/Documents/dask/dask/dataframe/shuffle.py:600: UserWarning: Partition indices have overlap.
  warnings.warn("Partition indices have overlap.")
>>> a.compute()
    value
key      
1       a
2       b
3       c
4       d
4       a
5       b
6       c
7       d
>>> a = a.repartition(divisions=a.divisions)
>>> a.compute()
    value
key      
1       a
2       b
3       c
4       a
5       b
6       c
7       d

Issue Analytics

State:
Created 4 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

2reactions

mrocklincommented, Jul 1, 2019

Is ddf.divisions necessarily a strictly increasing sequence

Yes

how would Dask handle when a ton of the rows have the same index?

Poorly

On Mon, Jul 1, 2019 at 3:27 PM Cody Johnson notifications@github.com wrote:

In that case, how would Dask handle when a ton of the rows have the same index? Is ddf.divisions necessarily a strictly increasing sequence? (Except the last one)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4860?email_source=notifications&email_token=AACKZTEZNKN2WU4RYMWGCMLP5IH4NA5CNFSM4HRBGJV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6JUCA#issuecomment-507288072, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTB7STM5I4OAPNYRW6LP5IH4NANCNFSM4HRBGJVQ .

0reactions

codercodycommented, Jul 3, 2019

On second thought, it does cause the strictness of the upper bound to be violated. If you deem that error-worthy then sure.

Top Results From Across the Web

pyspark dataframe not maintaining order after dropping a ...

In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since...

Managing Spark Partitions with Coalesce and Repartition

The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions. This algorithm obviously cannot increate the ......

Drop and Add Partition Ranges and Delete Rows Outside the ...

Example: Drop and Add Partition Ranges and Delete Rows Outside the Defined Ranges, ALTER TABLE syntax statement.

Spark Tips. Partition Tuning - Blog | luminousmen

Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips.

Dropping a Partition from a Table that Contains Data and ...

Issue the ALTER TABLE DROP PARTITION statement without maintaining global indexes. Afterward, you must rebuild any global indexes (whether partitioned or not) ...