Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"ValueError: Not all divisions are known, can't align partitions" when performing math on dataframe column

See original GitHub issue

I am trying to perform following on a dask dataframe:

newcolumn = np.log(df[column_name])
return df.assign(newcolumn_name=newcolumn)

I expected it to work according to example given in documentation. However I’m getting following error:

File ".../dask/dataframe/core.py", line 2321, in assign
    return elemwise(methods.assign, self, *pairs, meta=df2)
  File ".../dask/dataframe/core.py", line 2796, in elemwise
    args = _maybe_align_partitions(args)
  File ".../dask/dataframe/multi.py", line 147, in _maybe_align_partitions
    dfs2 = iter(align_partitions(*dfs)[0])
  File ".../dask/dataframe/multi.py", line 103, in align_partitions
    raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

I’m using dask 0.15.2.

I do have a workaround for this by using df.assign(newcolumn_name=lambda...) but the lambda function in this case needs to operate on whole record, not a specific set of columns

PS. In fact I’d like to do df[newcolumn_name] = np.log(df[column_name]) directly - is it supposed to work?

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

jcristcommented, Oct 8, 2018

You can’t easily, and you possibly shouldn’t.

dask.dataframe partitions data along the index. Information about the partition boundaries are kept, but not how many elements are in each partition. This is useful for many operations, but makes it hard to assign a column from a non-indexed type, as we don’t know where to split your array y_pred. You can do the assignment if you determine the indices in both df and y_pred (following the error message advice to use set_index), but this comes with some performance cost.

I recommend trying to rethink your problem to avoid this entirely. If you can’t, you can call set_index, but it has a cost:

In [51]: df = dd.from_pandas(pd.DataFrame({'a': range(100)}), npartitions=6, sort=False)

In [52]: df.known_divisions
Out[52]: False

In [53]: df2 = df.reset_index().set_index('index')  # Determine divisions. Here we re-index the existing indices

In [54]: df2.known_divisions
Out[54]: True

In [55]: y_pred = np.arange(100)

In [56]: df3 = df2.assign(e=dd.from_array(y_pred))

In [57]: df3.head()
Out[57]:
       a  e
index
0      0  0
1      1  1
2      2  2
3      3  3
4      4  4

0reactions

TomAugspurgercommented, Nov 20, 2018

Dask-ML has some utilities for cross validation: http://ml.dask.org/modules/api.html#module-dask_ml.model_selection

If you need actual permutation (with no repeated / omitted rows), then yes that’s going to be expensive.

Top Results From Across the Web

"ValueError: Not all divisions are known, can't align partitions ...

I am trying to perform following on a dask dataframe: newcolumn = np.log(df[column_name]) return df.assign(newcolumn_name=newcolumn) I ...

Not all divisions are known, can't align partitions error on dask ...

I wan't to perform some operations on them and run them using dask dataframe. This is what I do. raise ValueError("Not all divisions...

dask/dask - Gitter

Hello is there a way to add a new column to a dask dataframe containing the hash value ... ValueError: Not all divisions...

dask.dataframe.multi - Dask documentation

... no DataFrame and Series") if not all(df.known_divisions for df in dfs1): raise ValueError( "Not all divisions are known, can't align " "partitions....

[Code]-How to build a data frame using pandas where attributes are ...

Id Name Gender Math Science English 1 Ram Male 98 92 80 2 Hari Male 30 40 23 3 Gita ... If you...