Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: 1.4.0rc1 Execution time of pd.read_sql increased from seconds to minutes

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

import urllib
from sqlalchemy import create_engine
import pandas as pd

driver = "{ODBC Driver 17 for SQL Server}"
url, database, uid, pwd = ...

params = urllib.parse.quote(
    "DRIVER=" + driver + ";"
    "SERVER=" + url + ";"
    "DATABASE=" + database + ";"
    "UID=" + uid + ";"
    "PWD=" + pwd
)

engine = create_engine(
    "mssql+pyodbc:///?odbc_connect=%s" % params
)

conn = engine.connect()

df = pd.read_sql("SELECT * FROM Table", conn)

After upgrading to 1.4.0rc, I noticed that the execution time of one of my script increased from a few seconds to 2-3 minutes. The performance regression comes from a simple pd.read_sql. Downgrading back to 1.3.5 the execution time is again of the order of seconds.

I’m sorry if I cannot provide more details, I’m not an expert in debugging at this level of expertise.

Installed Versions

INSTALLED VERSIONS

commit : d023ba755322e09b95fd954bbdc43f5be224688e python : 3.10.1.final.0 python-bits : 64 OS : Linux OS-release : 5.15.12-200.fc35.x86_64 Version : #1 SMP Wed Dec 29 15:03:38 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : it_IT.UTF-8 LOCALE : it_IT.UTF-8

pandas : 1.4.0rc0 numpy : 1.21.5 pytz : 2021.3 dateutil : 2.8.1 pip : 21.2.3 setuptools : 57.4.0 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.1.2 blosc : None feather : None xlsxwriter : 3.0.2 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.0.1 IPython : 7.26.0 pandas_datareader: 0.10.0 bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.5.1 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pyxlsb : 1.0.9 s3fs : None scipy : 1.7.3 sqlalchemy : 1.4.29 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : None zstandard : None

Prior Performance

pandas : 1.3.5

Issue Analytics

State:
Created 2 years ago
Comments:9 (6 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Jan 14, 2022

OK, that gives some more insight: in the pandas 1.4 version, the time is completely taken by the “table reflection” when initializing the underlying class (and not the actual sql query). I should actually also have seen that in the text summary output …

So the time is all taken by this init:

https://github.com/pandas-dev/pandas/blob/d023ba755322e09b95fd954bbdc43f5be224688e/pandas/io/sql.py#L1376-L1381

This code was touched by https://github.com/pandas-dev/pandas/pull/43116, changing

        self.meta = MetaData(self.connectable, schema=schema)

        self.meta = MetaData(schema=schema)
        self.meta.reflect(bind=engine)

cc @fangchenli do you remember why the reflect(..) was needed? The reflect method will load all available table definitions from the database, which can be expensive (as illustrated by this report), and is also not needed generally I think (eg when only executing a sql query).

0reactions

fangchenlicommented, Jan 14, 2022

The old usage of MetaData will be removed in sqlalchemy 2.0. See https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#implicit-and-connectionless-execution-bound-metadata-removed for detail.

Instead of reflecting all tables in init, #45371 delays the reflection step to get_table method.

Top Results From Across the Web

read_sql() from MySQL is extremely slow - Stack Overflow

A simple query as this one takes more than 11 minutes to complete on a table with 11 milion rows. What actions could...

pandas.read_sql — pandas 1.5.2 documentation

Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).

Query takes 2 seconds and then 3 mins - Oracle Communities

I have this query (particular one) , which performance wise looks good as per the execution plan and it returns out the data...

Optimizing pandas.read_sql for Postgres | by Tristan Crockett

Reading SQL queries into Pandas dataframes is a common task, and one that can be very ... Elapsed time — the clock time...

How to read a SQL query into a pandas dataframe

In our first post, we went into the differences, similarities, and relative advantages of using SQL vs. pandas for data analysis. One of...