question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standardizing type for `divisions`

See original GitHub issue

Looking through Dask’s codebase, it seems like there isn’t a consistent typing for a Dask object’s divisions; in some places (like set_sorted_index), we return an object with a tuple divisions, while in others (such as set_partition) we return an object with a list divisions. This becomes an issue in cases where we compare divisions between different objects, as we can run into cases where the elements contained in two objects’ divisions are identical, but they are not seen as equal.

Some questions that come to mind:

  • Is there an ideal type for divisions? I would assume tuples since divisions is generally treated as immutable even in the list case, but list functionality is used in several places in the codebase to assemble divisions.
  • If there is an ideal type for divisions, how can we enforce it? It seems like one reason this problem exists is because in most places, list and tuple divisions function exactly the same - it is typically only when they are compared that issues arise. One potential solution would be to make divisions a property with a setter method that either:
    • Implicitly sets the input value to whatever type we desire divisions to be
    • Raises an error if the input value is not the proper divisions type
  • If there is no ideal type for divisions, is there a workaround for comparisons?

cc @jsignell as I notice you are doing some work on divisions in #8379

EDIT:

To give additional context, I encountered this issue while debugging some breakage in dask-sql:

In some cases, when performing JOIN operations, dask-sql implicitly calls single_partition_join through dd.merge. Recently, #8341 did some refactoring to this function which, among other things, changed the divisions of the merged result from a tuple to a list (I don’t think this was an intention of the PR, just a side effect).

This causes breakage later on in dask-sql if we attempt to subscript the result of this merge operation with a Series (something like df[df[col].where(...)]) with identical tuple divisions, as DataFrame.__getitem__ does a divisions check to decide whether or not to _maybe_align_partitions:

https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/core.py#L4108-L4113

And _maybe_align_partitions does a divisions equality check to decide whether or not to actually align_partitions (which fails if not all divisions are known):

https://github.com/dask/dask/blob/f5881891505b9a2ba2da195befb11ad7b4c7bb23/dask/dataframe/multi.py#L166-L169

Here’s a minimal reproducer of that particular issue:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({"a": list(range(40))})
ddf = dd.from_pandas(df, npartitions=4)

cond = ddf.a > 20

# set unknown but inequal divisions
ddf.divisions = [None] * 5
cond.divisions = (None,) * 5

ddf[cond]

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ian-r-rosecommented, Nov 17, 2021

It might also be useful to link this up with the work being done in #8295: a lot of this should be catchable statically.

1reaction
gjoseph92commented, Nov 17, 2021

make divisions a property with a setter method that Implicitly sets the input value to whatever type we desire divisions to be

+1. For how critical it is, I’ve always been surprised that it gets no validation. This would probably fix most of the problems?

It could also be reasonable for assert_eqand assert_divisions to check the type of divisions. This would have caught https://github.com/dask/dask/pull/8389, for example.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Process Standardization: A Complete Guide To Unifying Your ...
Standardization can be implemented in every type of business operation and department. Each operation comes with its own approved standards.
Read more >
A guide to standardized business processes, data, and ...
DPM standardizes “best perceptions” for optimizing both knowledge work design and performance – division of labor, specialization, and automation. It resembles ...
Read more >
Standardization or Harmonization? You need Both - BPTrends
Standardization means creating uniform business processes across various divisions or locations. The expected results are processes that consistently meet their ...
Read more >
Standardizing the name of political divisions
The fact that not only the government but also a large number of organizations, such as the United Nations, collects, manages and shares...
Read more >
Process Standardization 101: Definition, Benefits, Types ...
Process standardization involves documenting and formalizing all the activities, approaches and steps for executing tasks across departments ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found