question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

create a new column on existing dataframe

See original GitHub issue

I wonder if one could create a new column on existing dask dataframe

In [1]: %paste
In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: t0=pd.DataFrame({'v0':[i for i in range(1000000)]})

In [4]: t0['v1']=[i for i in range(1000000)]

In [5]: t0.head()

## -- End pasted text --
Out[1]:
   v0  v1
0   0   0
1   1   1
2   2   2
3   3   3
4   4   4

In [2]: t0_d = dd.from_pandas(t0, npartitions=10)

In [3]: t0_d['v2']=[i for i in range(1000000)]

In [4]: t0_d.head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-4b106fba7c25> in <module>()
----> 1 t0_d.head()

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py in head(self, n, compute)
    380
    381         if compute:
--> 382             result = result.compute()
    383         return result
    384

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs)
     84             Extra keywords to forward to the scheduler ``get`` function.
     85         """
---> 86         return compute(self, **kwargs)[0]
     87
     88     @classmethod

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs)
    177         dsk = merge(var.dask for var in variables)
    178     keys = [var._keys() for var in variables]
--> 179     results = get(dsk, keys, **kwargs)
    180
    181     results_iter = iter(results)

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/threaded.py in get(dsk, result, cache, num_workers, **kwargs)
     55     results = get_async(pool.apply_async, len(pool._pool), dsk, result,
     56                         cache=cache, queue=queue, get_id=_thread_get_id,
---> 57                         **kwargs)
     58
     59     return results

/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    482                 _execute_task(task, data)  # Re-execute locally
    483             else:
--> 484                 raise(remote_exception(res, tb))
    485         state['cache'][key] = res
    486         finish_task(dsk, key, state, results, keyorder.get)

ValueError: Length of values does not match length of index

Traceback
---------
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/compatibility.py", line 47, in apply
    return func(*args, **kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/array/core.py", line 2087, in partial_by_order
    return function(*args2, **kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/dask/dataframe/core.py", line 2047, in _assign
    return df.assign(**kwargs)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2527, in assign
    data[k] = v
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2357, in __setitem__
    self._set_item(key, value)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2423, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/frame.py", line 2578, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/SECDEV.LOCAL/akravchenko/anaconda3/lib/python3.5/site-packages/pandas/core/series.py", line 2770, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sheetalanns-blumecommented, May 5, 2021

Convert list to pd.Series, and then assign the series. The below code should work:

t0_d[‘v2’] = pd.Series([i for i in range(1000000)])

1reaction
jcristcommented, Jul 29, 2016

stack_of_files_dd['global_id'] = [i for i in range(stack_of_files_dd.divisions+1)]

divisions is a tuple, so that won’t actually work. If you mean npartitions, then the added column won’t have the proper length (it’d be as long as the number of partitions, not the length of the data). Since there’s no way to know the total length of a dataframe (and dask.dataframe is intended for data that you wouldn’t want to all have in memory on a single computer anyway) I don’t think we should support assignment of an in-memory column.

I assume what you want is a different id for data in each file in the glob. One way to do this would be to assign an id by broadcasting a scalar (which is supported in both dask and pandas):

df['global_id'] = 12345

So your code would then be:

frames = []
for i, file in enumerate(glob('stack_of_files_*.bcolz')):
    df = dd.from_bcolz(file, chunksize=1000000, lock=False)
    df['global_id'] = i
    frames.append(df)

df = dd.concat(frames)

This is currently supported, and should work fine.

P.S. If a single new column does not fit in-memory, could we append it (to existing dask dataframe) chunk by chunk from disk or just throw a good error message?

If you have a single new column that doesn’t fit in memory, create it as a dask.Series, and assign it the normal way (df['col'] = some_dask_series).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Adding new column to existing DataFrame in Pandas
Adding new column to existing DataFrame in Pandas ; # and equating it to the list. df[ 'Address' ] = address ; #...
Read more >
How to create new columns derived from existing ... - Pandas
To create a new column, use the [] brackets with the new column name at the left side of the assignment. Note. The...
Read more >
Add Column To Dataframe Pandas - Data Independent
Adding Column To Pandas DataFrame · 1. Declaring a new column name with a scalar or list of values¶. The easiest way to...
Read more >
Adding a New Column to an Existing Data Frame in Pandas
You can use the Python dictionary (key-value pair) to add a new column in an existing data frame. In this method, you must...
Read more >
How To Add A New Column To An Existing Pandas DataFrame
Introduction · insert one or multiple columns in one go · overwrite existing column(s) · add column(s) by taking into account the index...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found