PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
import urllib
from sqlalchemy import create_engine
import pandas as pd
driver = "{ODBC Driver 17 for SQL Server}"
url, database, uid, pwd = ...
params = urllib.parse.quote(
"DRIVER=" + driver + ";"
"SERVER=" + url + ";"
"DATABASE=" + database + ";"
"UID=" + uid + ";"
"PWD=" + pwd
)
engine = create_engine(
"mssql+pyodbc:///?odbc_connect=%s" % params
)
conn = engine.connect()
df = pd.read_sql("SELECT * FROM Table", conn)
After upgrading to 1.4.0rc, I noticed that the execution time of one of my script increased from a few seconds to 2-3 minutes. The performance regression comes from a simple pd.read_sql. Downgrading back to 1.3.5 the execution time is again of the order of seconds.
I’m sorry if I cannot provide more details, I’m not an expert in debugging at this level of expertise.
Installed Versions
INSTALLED VERSIONS
commit : d023ba755322e09b95fd954bbdc43f5be224688e python : 3.10.1.final.0 python-bits : 64 OS : Linux OS-release : 5.15.12-200.fc35.x86_64 Version : #1 SMP Wed Dec 29 15:03:38 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : it_IT.UTF-8 LOCALE : it_IT.UTF-8
pandas : 1.4.0rc0 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.1 pip : 21.2.3 setuptools : 57.4.0 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.1.2 blosc : None feather : None xlsxwriter : 3.0.2 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.26.0 pandas_datareader: 0.10.0 bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.5.1 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pyxlsb : 1.0.9 s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.29 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : None zstandard : None
Prior Performance
pandas : 1.3.5
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
OK, that gives some more insight: in the pandas 1.4 version, the time is completely taken by the “table reflection” when initializing the underlying class (and not the actual sql query). I should actually also have seen that in the text summary output …
So the time is all taken by this init:
https://github.com/pandas-dev/pandas/blob/d023ba755322e09b95fd954bbdc43f5be224688e/pandas/io/sql.py#L1376-L1381
This code was touched by https://github.com/pandas-dev/pandas/pull/43116, changing
to
cc @fangchenli do you remember why the
reflect(..)
was needed? Thereflect
method will load all available table definitions from the database, which can be expensive (as illustrated by this report), and is also not needed generally I think (eg when only executing a sql query).The old usage of MetaData will be removed in sqlalchemy 2.0. See https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#implicit-and-connectionless-execution-bound-metadata-removed for detail.
Instead of reflecting all tables in init, #45371 delays the reflection step to
get_table
method.