question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low performance when assigning to multiple columns

See original GitHub issue

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame(np.zeros((200000, 9)), columns=list('hijabcfde'))
# mix datatypes
df['z'] = 'u'
data = np.random.random((200000, 3))
# explicit loop is fast
%timeit for i, k in enumerate('hjf'): df[i] = data[:, i]  2.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setting through a list of columns is slow

%timeit df[list('hjf')] = data # 144 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[:, list('hjf')] = data # 137 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.iloc[:, df.columns.get_indexer_for(list('hjf'))] = data # 141 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note without the df['z'] = 'u' line all the assignment are similarly fast, around 5 ms.

Problem description

This difference is performance is strange and doesn’t seem justified. Running df[list('hjf')] = data through snakeviz gives the following output:

snakeviz output

Most of the time is spent in the _sanitize_columns method.

Expected Output

The assignation should not be much longer with mixed datatypes (at least if the columns assigned to are of homogeneous dtype).

Output of pd.show_versions()

```code INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-29-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.23.3 pytest: 3.2.2 pip: 18.0 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: 1.1.0 pyarrow: None xarray: 0.10.8 IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

</details>

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Jul 30, 2018

This fixes, though has a couple of test failures, which look like easy adjustments at first glance. I don’t really remember the rationale for this, was an edge though.

diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
index 13c019dea..7bf1b07d8 100755
--- a/pandas/core/indexing.py
+++ b/pandas/core/indexing.py
@@ -590,7 +590,7 @@ class _NDFrameIndexer(_NDFrameIndexerBase):
 
                     # note that this coerces the dtype if we are mixed
                     # GH 7551
-                    value = np.array(value, dtype=object)
+                    value = np.asarray(value)
                     if len(labels) != value.shape[1]:
                         raise ValueError('Must have equal len keys and value '
                                          'when setting with an ndarray')
@@ -598,7 +598,7 @@ class _NDFrameIndexer(_NDFrameIndexerBase):
                     for i, item in enumerate(labels):
 
                         # setting with a list, recoerces
-                        setter(item, value[:, i].tolist())
+                        setter(item, value[:, i])
 
                 # we have an equal len list/ndarray
                 elif can_do_equal_len():
0reactions
phoflcommented, Dec 17, 2020

No I don’t think so. Missed something previously. The PR was about assigning new columns not modifying existing ones

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting "Performance Warning" when trying to add multiple ...
If it exists, I want to add new column suppose 'vlan3' with some value at the same index ('time') row. If there is...
Read more >
Pandas, Fast and Slow - Medium
One method is elegant and slow and the other method is ugly and fast. We split a string column into multiple columns using...
Read more >
Spark DataFrame withColumn
Spark withColumn () is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing...
Read more >
Guidelines and examples for sorting and filtering data by color
Who are the highest performing and lowest performing students in the freshman ... hide rows that you do not want displayed, for one...
Read more >
SQL Performance Best Practices | CockroachDB Docs
For more information, see Batch delete expired data with Row-Level TTL. Assign column families. A column family is a group of columns in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found