question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2

See original GitHub issue

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)

Problem description

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.

 4244 function calls (4210 primitive calls) in 10.273 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.202   10.202   10.204   10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.011    0.011   10.262   10.262 parsers.py:414(_read)
        1    0.011    0.011   10.273   10.273 <string>:1(<module>)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)

Expected Output

3229 function calls (3222 primitive calls) in 2.944 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.881    2.881    2.882    2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.045    0.045    0.045    0.045 internals.py:4801(_stack_arrays)
        1    0.010    0.010    2.944    2.944 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
      320    0.001    0.000    0.001    0.000 common.py:777(is_integer_dtype)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.p

Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV

INSTALLED VERSIONS

commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 2008ServerR2 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.23.4 pytest: 3.9.2 pip: 18.1 setuptools: 40.4.3 Cython: 0.29 numpy: 1.15.3 scipy: 1.1.0 pyarrow: 0.11.0 xarray: 0.10.9 IPython: 7.0.1 sphinx: 1.8.1 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: 1.6.1 bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.0 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.12 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.6 pandas_gbq: None pandas_datareader: None

Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV

INSTALLED VERSIONS

commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0 pytest: 3.5.0 pip: 9.0.3 setuptools: 20.10.1 Cython: 0.28.1 numpy: 1.14.2 scipy: 1.0.1 pyarrow: 0.9.0 xarray: 0.10.2 IPython: 6.3.0 sphinx: 1.7.2 patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: 0.4.0 matplotlib: 2.2.2 openpyxl: 2.5.1 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml: 4.2.1 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.6 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: 0.1.5 pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:56 (24 by maintainers)

github_iconTop GitHub Comments

2reactions
cgohlkecommented, Nov 10, 2018

I compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time…

3.7.0a3:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	705	347	28.52%	14.04%	parsers.cp37-win_amd64.pyd
   isdigit	207	207	8.37%	8.37%	ucrtbase.dll
 - _errno	105	39	4.25%	1.58%	ucrtbase.dll
   toupper	24	24	0.97%	0.97%	ucrtbase.dll
   isspace	21	21	0.85%	0.85%	ucrtbase.dll
   [python37.dll]	1	1	0.04%	0.04%	python37.dll
3.7.0a4:
Function Name	Inclusive Samples	Exclusive Samples	Inclusive Samples %	Exclusive Samples %	Module Name
 + [parsers.cp37-win_amd64.pyd]	8,613	478	83.04%	4.61%	parsers.cp37-win_amd64.pyd
 + isdigit	6,642	208	64.04%	2.01%	ucrtbase.dll
 + _isdigit_l	6,434	245	62.03%	2.36%	ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate	5,806	947	55.98%	9.13%	ucrtbase.dll
 + __acrt_getptd	2,121	1,031	20.45%	9.94%	ucrtbase.dll
   FlsGetValue	647	647	6.24%	6.24%	KernelBase.dll
 - RtlSetLastWin32Error	296	235	2.85%	2.27%	ntdll.dll
   _guard_dispatch_icall_nop	101	101	0.97%	0.97%	ucrtbase.dll
   GetLastError	46	46	0.44%	0.44%	KernelBase.dll
 + __acrt_update_multibyte_info	1,475	246	14.22%	2.37%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,229	513	11.85%	4.95%	ucrtbase.dll
 + __acrt_update_locale_info	1,263	235	12.18%	2.27%	ucrtbase.dll
 - __crt_state_management::get_current_state_index	1,028	429	9.91%	4.14%	ucrtbase.dll
   _ischartype_l	383	383	3.69%	3.69%	ucrtbase.dll
1reaction
toniatopcommented, Nov 8, 2018

Redone everything forcing --channel anaconda, same results.

Read more comments on GitHub >

github_iconTop Results From Across the Web

faster pandas, pandas alternative for big data, pandas read_csv slow,
pd.read_csv () using _libs.parsers.TextReader read () method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found