Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

create a new column on existing dataframe

See original GitHub issue

I wonder if one could create a new column on existing dask dataframe

In [1]: %paste
In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: t0=pd.DataFrame({'v0':[i for i in range(1000000)]})

In [4]: t0['v1']=[i for i in range(1000000)]

In [5]: t0.head()

## -- End pasted text --
Out[1]:
   v0  v1
0   0   0
1   1   1
2   2   2
3   3   3
4   4   4

In [2]: t0_d = dd.from_pandas(t0, npartitions=10)

In [3]: t0_d['v2']=[i for i in range(1000000)]

In [4]: t0_d.head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-4b106fba7c25> in <module>()
----> 1 t0_d.head()

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
    380
    381         if compute:
--> 382             result = result.compute()
    383         return result
    384

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     84             Extra keywords to forward to the scheduler ``get`` function.
     85         """
---> 86         return compute(self, **kwargs)[0]
     87
     88     @classmethod

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    177         dsk = merge(var.dask for var in variables)
    178     keys = [var._keys() for var in variables]
--> 179     results = get(dsk, keys, **kwargs)
    180
    181     results_iter = iter(results)

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58
     59     return results

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    482                 _execute_task(task, data)  # Re-execute locally
    483             else:
--> 484                 raise(remote_exception(res, tb))
    485         state['cache'][key] = res
    486         finish_task(dsk, key, state, results, keyorder.get)

ValueError: Length of values does not match length of index

Traceback
---------
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 47, in apply
    return func(*args, **kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/array/core.py", line 2087, in partial_by_order
    return function(*args2, **kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 2047, in _assign
    return df.assign(**kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2527, in assign
    data[k] = v
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2357, in __setitem__
    self._set_item(key, value)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2423, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2578, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2770, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')

Issue Analytics

State:
Created 7 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

sheetalanns-blumecommented, May 5, 2021

Convert list to pd.Series, and then assign the series. The below code should work:

t0_d[‘v2’] = pd.Series([i for i in range(1000000)])

1reaction

jcristcommented, Jul 29, 2016

stack_of_files_dd['global_id'] = [i for i in range(stack_of_files_dd.divisions+1)]

divisions is a tuple, so that won’t actually work. If you mean npartitions, then the added column won’t have the proper length (it’d be as long as the number of partitions, not the length of the data). Since there’s no way to know the total length of a dataframe (and dask.dataframe is intended for data that you wouldn’t want to all have in memory on a single computer anyway) I don’t think we should support assignment of an in-memory column.

I assume what you want is a different id for data in each file in the glob. One way to do this would be to assign an id by broadcasting a scalar (which is supported in both dask and pandas):

df['global_id'] = 12345

So your code would then be:

frames = []
for i, file in enumerate(glob('stack_of_files_*.bcolz')):
    df = dd.from_bcolz(file, chunksize=1000000, lock=False)
    df['global_id'] = i
    frames.append(df)

df = dd.concat(frames)

This is currently supported, and should work fine.

P.S. If a single new column does not fit in-memory, could we append it (to existing dask dataframe) chunk by chunk from disk or just throw a good error message?

If you have a single new column that doesn’t fit in memory, create it as a dask.Series, and assign it the normal way (df['col'] = some_dask_series).