Low performance when assigning to multiple columns
See original GitHub issueCode Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((200000, 9)), columns=list('hijabcfde'))
# mix datatypes
df['z'] = 'u'
data = np.random.random((200000, 3))
# explicit loop is fast
%timeit for i, k in enumerate('hjf'): df[i] = data[:, i] 2.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# setting through a list of columns is slow
%timeit df[list('hjf')] = data # 144 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[:, list('hjf')] = data # 137 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.iloc[:, df.columns.get_indexer_for(list('hjf'))] = data # 141 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note without the df['z'] = 'u'
line all the assignment are similarly fast, around 5 ms.
Problem description
This difference is performance is strange and doesn’t seem justified. Running df[list('hjf')] = data
through snakeviz gives the following output:
Most of the time is spent in the _sanitize_columns
method.
Expected Output
The assignation should not be much longer with mixed datatypes (at least if the columns assigned to are of homogeneous dtype).
Output of pd.show_versions()
pandas: 0.23.3 pytest: 3.2.2 pip: 18.0 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: 1.1.0 pyarrow: None xarray: 0.10.8 IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: 3.4.4 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
</details>
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (11 by maintainers)
Top GitHub Comments
This fixes, though has a couple of test failures, which look like easy adjustments at first glance. I don’t really remember the rationale for this, was an edge though.
No I don’t think so. Missed something previously. The PR was about assigning new columns not modifying existing ones