question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"ValueError: Not all divisions are known, can't align partitions" when performing math on dataframe column

See original GitHub issue

I am trying to perform following on a dask dataframe:

newcolumn = np.log(df[column_name])
return df.assign(newcolumn_name=newcolumn)

I expected it to work according to example given in documentation. However I’m getting following error:

File ".../dask/dataframe/core.py", line 2321, in assign
    return elemwise(methods.assign, self, *pairs, meta=df2)
  File ".../dask/dataframe/core.py", line 2796, in elemwise
    args = _maybe_align_partitions(args)
  File ".../dask/dataframe/multi.py", line 147, in _maybe_align_partitions
    dfs2 = iter(align_partitions(*dfs)[0])
  File ".../dask/dataframe/multi.py", line 103, in align_partitions
    raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

I’m using dask 0.15.2.

I do have a workaround for this by using df.assign(newcolumn_name=lambda...) but the lambda function in this case needs to operate on whole record, not a specific set of columns

PS. In fact I’d like to do df[newcolumn_name] = np.log(df[column_name]) directly - is it supposed to work?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jcristcommented, Oct 8, 2018

You can’t easily, and you possibly shouldn’t.

dask.dataframe partitions data along the index. Information about the partition boundaries are kept, but not how many elements are in each partition. This is useful for many operations, but makes it hard to assign a column from a non-indexed type, as we don’t know where to split your array y_pred. You can do the assignment if you determine the indices in both df and y_pred (following the error message advice to use set_index), but this comes with some performance cost.

I recommend trying to rethink your problem to avoid this entirely. If you can’t, you can call set_index, but it has a cost:

In [51]: df = dd.from_pandas(pd.DataFrame({'a': range(100)}), npartitions=6, sort=False)

In [52]: df.known_divisions
Out[52]: False

In [53]: df2 = df.reset_index().set_index('index')  # Determine divisions. Here we re-index the existing indices

In [54]: df2.known_divisions
Out[54]: True

In [55]: y_pred = np.arange(100)

In [56]: df3 = df2.assign(e=dd.from_array(y_pred))

In [57]: df3.head()
Out[57]:
       a  e
index
0      0  0
1      1  1
2      2  2
3      3  3
4      4  4
0reactions
TomAugspurgercommented, Nov 20, 2018

Dask-ML has some utilities for cross validation: http://ml.dask.org/modules/api.html#module-dask_ml.model_selection

If you need actual permutation (with no repeated / omitted rows), then yes that’s going to be expensive.

Read more comments on GitHub >

github_iconTop Results From Across the Web

"ValueError: Not all divisions are known, can't align partitions ...
I am trying to perform following on a dask dataframe: newcolumn = np.log(df[column_name]) return df.assign(newcolumn_name=newcolumn) I ...
Read more >
Not all divisions are known, can't align partitions error on dask ...
I wan't to perform some operations on them and run them using dask dataframe. This is what I do. raise ValueError("Not all divisions...
Read more >
dask/dask - Gitter
Hello is there a way to add a new column to a dask dataframe containing the hash value ... ValueError: Not all divisions...
Read more >
dask.dataframe.multi - Dask documentation
... no DataFrame and Series") if not all(df.known_divisions for df in dfs1): raise ValueError( "Not all divisions are known, can't align " "partitions....
Read more >
[Code]-How to build a data frame using pandas where attributes are ...
Id Name Gender Math Science English 1 Ram Male 98 92 80 2 Hari Male 30 40 23 3 Gita ... If you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found