Repartition randomly dropping some rows
See original GitHub issueBug report
When I called set_index(..., sorted=True)
, I got a warning saying partition indices have overlap, which is fine. From here, after calling repartition
, I lost a row. This must be a bug in the repartition
code since I don’t see why repartition
would make me lose rows.
>>> import pandas as pd
>>> import dask.dataframe as dd
>>> A = pd.DataFrame({'key': [1,2,3,4,4,5,6,7], 'value': list('abcd'*2)})
>>> a = dd.from_pandas(A, npartitions=2)
>>> a = a.set_index('key', sorted=True)
/Users/ctj/Documents/dask/dask/dataframe/shuffle.py:600: UserWarning: Partition indices have overlap.
warnings.warn("Partition indices have overlap.")
>>> a.compute()
value
key
1 a
2 b
3 c
4 d
4 a
5 b
6 c
7 d
>>> a = a.repartition(divisions=a.divisions)
>>> a.compute()
value
key
1 a
2 b
3 c
4 a
5 b
6 c
7 d
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
pyspark dataframe not maintaining order after dropping a ...
In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since...
Read more >Managing Spark Partitions with Coalesce and Repartition
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions. This algorithm obviously cannot increate the ......
Read more >Drop and Add Partition Ranges and Delete Rows Outside the ...
Example: Drop and Add Partition Ranges and Delete Rows Outside the Defined Ranges, ALTER TABLE syntax statement.
Read more >Spark Tips. Partition Tuning - Blog | luminousmen
Data partitioning is critical to data processing performance especially for large volumes of data processing in Spark. Here are some partitioning tips.
Read more >Dropping a Partition from a Table that Contains Data and ...
Issue the ALTER TABLE DROP PARTITION statement without maintaining global indexes. Afterward, you must rebuild any global indexes (whether partitioned or not) ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes
Poorly
On Mon, Jul 1, 2019 at 3:27 PM Cody Johnson notifications@github.com wrote:
On second thought, it does cause the strictness of the upper bound to be violated. If you deem that error-worthy then sure.