create a new column on existing dataframe
See original GitHub issueI wonder if one could create a new column on existing dask dataframe
In [1]: %paste
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: t0=pd.DataFrame({'v0':[i for i in range(1000000)]})
In [4]: t0['v1']=[i for i in range(1000000)]
In [5]: t0.head()
## -- End pasted text --
Out[1]:
v0 v1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
In [2]: t0_d = dd.from_pandas(t0, npartitions=10)
In [3]: t0_d['v2']=[i for i in range(1000000)]
In [4]: t0_d.head()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-4b106fba7c25> in <module>()
----> 1 t0_d.head()
/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
380
381 if compute:
--> 382 result = result.compute()
383 return result
384
/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
84 Extra keywords to forward to the scheduler ``get`` function.
85 """
---> 86 return compute(self, **kwargs)[0]
87
88 @classmethod
/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
55 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
56 cache=cache, queue=queue, get_id=_thread_get_id,
---> 57 **kwargs)
58
59 return results
/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
482 _execute_task(task, data) # Re-execute locally
483 else:
--> 484 raise(remote_exception(res, tb))
485 state['cache'][key] = res
486 finish_task(dsk, key, state, results, keyorder.get)
ValueError: Length of values does not match length of index
Traceback
---------
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 47, in apply
return func(*args, **kwargs)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/array/core.py", line 2087, in partial_by_order
return function(*args2, **kwargs)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 2047, in _assign
return df.assign(**kwargs)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2527, in assign
data[k] = v
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2357, in __setitem__
self._set_item(key, value)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2423, in _set_item
value = self._sanitize_column(key, value)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2578, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2770, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
Issue Analytics
- State:
- Created 7 years ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Adding new column to existing DataFrame in Pandas
Adding new column to existing DataFrame in Pandas ; # and equating it to the list. df[ 'Address' ] = address ; #...
Read more >How to create new columns derived from existing ... - Pandas
To create a new column, use the [] brackets with the new column name at the left side of the assignment. Note. The...
Read more >Add Column To Dataframe Pandas - Data Independent
Adding Column To Pandas DataFrame · 1. Declaring a new column name with a scalar or list of values¶. The easiest way to...
Read more >Adding a New Column to an Existing Data Frame in Pandas
You can use the Python dictionary (key-value pair) to add a new column in an existing data frame. In this method, you must...
Read more >How To Add A New Column To An Existing Pandas DataFrame
Introduction · insert one or multiple columns in one go · overwrite existing column(s) · add column(s) by taking into account the index...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Convert list to pd.Series, and then assign the series. The below code should work:
t0_d[‘v2’] = pd.Series([i for i in range(1000000)])
divisions
is a tuple, so that won’t actually work. If you meannpartitions
, then the added column won’t have the proper length (it’d be as long as the number of partitions, not the length of the data). Since there’s no way to know the total length of a dataframe (and dask.dataframe is intended for data that you wouldn’t want to all have in memory on a single computer anyway) I don’t think we should support assignment of an in-memory column.I assume what you want is a different id for data in each file in the glob. One way to do this would be to assign an id by broadcasting a scalar (which is supported in both dask and pandas):
So your code would then be:
This is currently supported, and should work fine.
If you have a single new column that doesn’t fit in memory, create it as a
dask.Series
, and assign it the normal way (df['col'] = some_dask_series
).