consider disabling #4645 for empty in and generate different cache key instead
See original GitHub issueDescribe the bug
Somewhere between 1.3.24 and 1.4.0 the treatment of empty tuples used with in_
changed. In 1.3.x, an empty tuple was omitted from the compiled query. In 1.4.x, I see that the new SQLCompiler
(or in my specific case the PGCompiler
subclass) creates a placeholder with the POSTCOMPILE
prefix. I think it does this in https://github.com/sqlalchemy/sqlalchemy/blob/91562f56185e1e69676712ccb109865a6622a538/lib/sqlalchemy/sql/compiler.py#L1920
After the upgrade to 1.4.x, I see a massive CPU spike on my DB when one of these queries is executed. I’m guessing that this may be due to a new value being created for each comparison but I don’t know enough about PG or SQLA internals to say for certain.
In my case, if an empty tuple was provided, the value of the POSTCOMPILE
placeholder resolves to something like (SELECT CAST(NULL AS VARCHAR(40)) WHERE 1!=1)
. I’m not entirely sure why this causes a massive increase in query time and CPU usage, but it does. The table being queried has over 400 million records in it (NSRL dataset).
With 1.3.x, the empty tuple was actually dropped from the query entirely:
SELECT hashes.id, hashes.md5, hashes.sha1
FROM hashes
WHERE hashes.md5 IN (%(md5_1)s) OR 1 != 1
In 1.4.x, the empty tuple is replaced with the cast:
SELECT hashes.id
FROM hashes
WHERE hashes.md5 IN ([POSTCOMPILE_md5_1]) OR hashes.sha1 IN ([POSTCOMPILE_sha1_1])
Some issues I think may be related: https://github.com/sqlalchemy/sqlalchemy/issues/6290 https://github.com/sqlalchemy/sqlalchemy/issues/6222 https://github.com/sqlalchemy/sqlalchemy/issues/4271
Expected behavior My queries to not take minutes.
To Reproduce
import sqlalchemy
from sqlalchemy import Table, Column, String, Integer, or_
metadata = sqlalchemy.MetaData()
hashes_table = Table(
'hashes', metadata,
Column('id', Integer, primary_key=True),
Column('md5', String(32)),
Column('sha1', String(40)),
)
sel = sqlalchemy.select(
hashes_table.columns.id # 1.4.x
#hashes.table.columns # 1.3.x
).where(
or_(
hashes_table.columns['md5'].in_(['some_md5_sum']),
hashes_table.columns['sha1'].in_([])
)
)
engine = sqlalchemy.create_engine('postgresql://...')
q = sel.compile(dialect=engine.dialect)
print(q)
connection = engine.connect()
results = connection.execute(q).fetchall()
# This will take a really long time in large tables
Error
No stack trace available as the query will complete eventually. However, the issue is the inefficient behavior resulting from the query. Specifically, the nested SELECT CAST(...)
I think.
Versions.
- OS: Any
- Python: Python3.7
- SQLAlchemy: 1.4.x
- Database: PostgreSQL
- DBAPI:
Additional context This issue seems to only show up with large tables. My specific use-case is with the NSRL File table from the NSRL RDS set. It contains over 400 million records of common binaries and their cryptographic hashes.
Have a nice day!
Issue Analytics
- State:
- Created 2 years ago
- Comments:30 (23 by maintainers)
Confirmed that your fix seems to work fine:
Thanks for everyone’s work on this!
OK @CaselIT has another idea that is a lot more self contained if it works…