"ValueError: Not all divisions are known, can't align partitions" when performing math on dataframe column
See original GitHub issueI am trying to perform following on a dask dataframe:
newcolumn = np.log(df[column_name])
return df.assign(newcolumn_name=newcolumn)
I expected it to work according to example given in documentation. However I’m getting following error:
File ".../dask/dataframe/core.py", line 2321, in assign
return elemwise(methods.assign, self, *pairs, meta=df2)
File ".../dask/dataframe/core.py", line 2796, in elemwise
args = _maybe_align_partitions(args)
File ".../dask/dataframe/multi.py", line 147, in _maybe_align_partitions
dfs2 = iter(align_partitions(*dfs)[0])
File ".../dask/dataframe/multi.py", line 103, in align_partitions
raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I’m using dask 0.15.2.
I do have a workaround for this by using df.assign(newcolumn_name=lambda...)
but the lambda function in this case needs to operate on whole record, not a specific set of columns
PS. In fact I’d like to do df[newcolumn_name] = np.log(df[column_name])
directly - is it supposed to work?
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:10 (5 by maintainers)
Top Results From Across the Web
"ValueError: Not all divisions are known, can't align partitions ...
I am trying to perform following on a dask dataframe: newcolumn = np.log(df[column_name]) return df.assign(newcolumn_name=newcolumn) I ...
Read more >Not all divisions are known, can't align partitions error on dask ...
I wan't to perform some operations on them and run them using dask dataframe. This is what I do. raise ValueError("Not all divisions...
Read more >dask/dask - Gitter
Hello is there a way to add a new column to a dask dataframe containing the hash value ... ValueError: Not all divisions...
Read more >dask.dataframe.multi - Dask documentation
... no DataFrame and Series") if not all(df.known_divisions for df in dfs1): raise ValueError( "Not all divisions are known, can't align " "partitions....
Read more >[Code]-How to build a data frame using pandas where attributes are ...
Id Name Gender Math Science English 1 Ram Male 98 92 80 2 Hari Male 30 40 23 3 Gita ... If you...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You can’t easily, and you possibly shouldn’t.
dask.dataframe
partitions data along the index. Information about the partition boundaries are kept, but not how many elements are in each partition. This is useful for many operations, but makes it hard to assign a column from a non-indexed type, as we don’t know where to split your arrayy_pred
. You can do the assignment if you determine the indices in bothdf
andy_pred
(following the error message advice to useset_index
), but this comes with some performance cost.I recommend trying to rethink your problem to avoid this entirely. If you can’t, you can call
set_index
, but it has a cost:Dask-ML has some utilities for cross validation: http://ml.dask.org/modules/api.html#module-dask_ml.model_selection
If you need actual permutation (with no repeated / omitted rows), then yes that’s going to be expensive.