Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low performance when assigning to multiple columns

See original GitHub issue

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.zeros((200000, 9)), columns=list('hijabcfde'))
# mix datatypes
df['z'] = 'u'
data = np.random.random((200000, 3))
# explicit loop is fast
%timeit for i, k in enumerate('hjf'): df[i] = data[:, i]  2.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setting through a list of columns is slow

%timeit df[list('hjf')] = data # 144 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[:, list('hjf')] = data # 137 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.iloc[:, df.columns.get_indexer_for(list('hjf'))] = data # 141 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note without the df['z'] = 'u' line all the assignment are similarly fast, around 5 ms.

Problem description

This difference is performance is strange and doesn’t seem justified. Running df[list('hjf')] = data through snakeviz gives the following output:

snakeviz output

Most of the time is spent in the _sanitize_columns method.

Expected Output

The assignation should not be much longer with mixed datatypes (at least if the columns assigned to are of homogeneous dtype).

Output of `pd.show_versions()`

```code INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.23.3 pytest: 3.2.2 pip: 18.0 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: 1.1.0 pyarrow: None xarray: 0.10.8 IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

</details>

Issue Analytics

State:
Created 5 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

jrebackcommented, Jul 30, 2018

This fixes, though has a couple of test failures, which look like easy adjustments at first glance. I don’t really remember the rationale for this, was an edge though.

diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
index 13c019dea..7bf1b07d8 100755
--- a/pandas/core/indexing.py
+++ b/pandas/core/indexing.py
@@ -590,7 +590,7 @@ class _NDFrameIndexer(_NDFrameIndexerBase):
 
                     # note that this coerces the dtype if we are mixed
                     # GH 7551
-                    value = np.array(value, dtype=object)
+                    value = np.asarray(value)
                     if len(labels) != value.shape[1]:
                         raise ValueError('Must have equal len keys and value '
                                          'when setting with an ndarray')
@@ -598,7 +598,7 @@ class _NDFrameIndexer(_NDFrameIndexerBase):
                     for i, item in enumerate(labels):
 
                         # setting with a list, recoerces
-                        setter(item, value[:, i].tolist())
+                        setter(item, value[:, i])
 
                 # we have an equal len list/ndarray
                 elif can_do_equal_len():

0reactions

phoflcommented, Dec 17, 2020

No I don’t think so. Missed something previously. The PR was about assigning new columns not modifying existing ones