question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Repartition randomly dropping some rows

See original GitHub issue

Bug report

When I called set_index(..., sorted=True), I got a warning saying partition indices have overlap, which is fine. From here, after calling repartition, I lost a row. This must be a bug in the repartition code since I don’t see why repartition would make me lose rows.

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> A = pd.DataFrame({'key': [1,2,3,4,4,5,6,7], 'value': list('abcd'*2)})
>>> a = dd.from_pandas(A, npartitions=2)
>>> a = a.set_index('key', sorted=True)
/Users/ctj/Documents/dask/dask/dataframe/shuffle.py:600: UserWarning: Partition indices have overlap.
  warnings.warn("Partition indices have overlap.")
>>> a.compute()
    value
key      
1       a
2       b
3       c
4       d
4       a
5       b
6       c
7       d
>>> a = a.repartition(divisions=a.divisions)
>>> a.compute()
    value
key      
1       a
2       b
3       c
4       a
5       b
6       c
7       d

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
mrocklincommented, Jul 1, 2019

Is ddf.divisions necessarily a strictly increasing sequence

Yes

how would Dask handle when a ton of the rows have the same index?

Poorly

On Mon, Jul 1, 2019 at 3:27 PM Cody Johnson notifications@github.com wrote:

In that case, how would Dask handle when a ton of the rows have the same index? Is ddf.divisions necessarily a strictly increasing sequence? (Except the last one)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask/issues/4860?email_source=notifications&email_token=AACKZTEZNKN2WU4RYMWGCMLP5IH4NA5CNFSM4HRBGJV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6JUCA#issuecomment-507288072, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTB7STM5I4OAPNYRW6LP5IH4NANCNFSM4HRBGJVQ .

0reactions
codercodycommented, Jul 3, 2019

On second thought, it does cause the strictness of the upper bound to be violated. If you deem that error-worthy then sure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pyspark dataframe not maintaining order after dropping a ...
In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since...
Read more >
Managing Spark Partitions with Coalesce and Repartition
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions. This algorithm obviously cannot increate the ......
Read more >
Drop and Add Partition Ranges and Delete Rows Outside the ...
Example: Drop and Add Partition Ranges and Delete Rows Outside the Defined Ranges, ALTER TABLE syntax statement.
Read more >
Spark Tips. Partition Tuning - Blog | luminousmen
Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips.
Read more >
Dropping a Partition from a Table that Contains Data and ...
Issue the ALTER TABLE DROP PARTITION statement without maintaining global indexes. Afterward, you must rebuild any global indexes (whether partitioned or not) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found