read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2
See original GitHub issueCode Sample, a copy-pastable example if possible
import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)
Problem description
pd.read_csv()
using _libs.parsers.TextReader
read()
method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.
4244 function calls (4210 primitive calls) in 10.273 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 10.202 10.202 10.204 10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
1 0.039 0.039 0.039 0.039 internals.py:5017(_stack_arrays)
1 0.011 0.011 10.262 10.262 parsers.py:414(_read)
1 0.011 0.011 10.273 10.273 <string>:1(<module>)
1 0.004 0.004 0.004 0.004 parsers.py:1685(__init__)
321 0.001 0.000 0.002 0.000 common.py:811(is_integer_dtype)
Expected Output
3229 function calls (3222 primitive calls) in 2.944 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.881 2.881 2.882 2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
1 0.045 0.045 0.045 0.045 internals.py:4801(_stack_arrays)
1 0.010 0.010 2.944 2.944 parsers.py:423(_read)
1 0.004 0.004 0.004 0.004 parsers.py:1677(__init__)
320 0.001 0.000 0.001 0.000 common.py:777(is_integer_dtype)
1 0.001 0.001 0.001 0.001 {method 'close' of 'pandas._libs.p
Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV
INSTALLED VERSIONS
commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 2008ServerR2 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.23.4 pytest: 3.9.2 pip: 18.1 setuptools: 40.4.3 Cython: 0.29 numpy: 1.15.3 scipy: 1.1.0 pyarrow: 0.11.0 xarray: 0.10.9 IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: 1.6.1 bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.6 pandas_gbq: None pandas_datareader: None
Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.3 setuptools: 20.10.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: 0.9.0 xarray: 0.10.2 IPython: 6.3.0 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: 0.4.0 matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:56 (24 by maintainers)
Top GitHub Comments
I compared the statement
df2 = pd.read_csv(csv)
on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is theisdigit
function called in theparsers
extension module. On3.7.0a3
the function is fast at ~8% of samples. On3.7.0a4
the function is slow at ~64% samples because it calls the_isdigit_l
function, which seems to update and restore the locale in the current thread every time…Redone everything forcing --channel anaconda, same results.