Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

consider disabling #4645 for empty in and generate different cache key instead

See original GitHub issue

Describe the bug Somewhere between 1.3.24 and 1.4.0 the treatment of empty tuples used with in_ changed. In 1.3.x, an empty tuple was omitted from the compiled query. In 1.4.x, I see that the new SQLCompiler (or in my specific case the PGCompiler subclass) creates a placeholder with the POSTCOMPILE prefix. I think it does this in https://github.com/sqlalchemy/sqlalchemy/blob/91562f56185e1e69676712ccb109865a6622a538/lib/sqlalchemy/sql/compiler.py#L1920

https://github.com/sqlalchemy/sqlalchemy/blob/7fdaac7b2910b5612420378519b9f60d4649daff/lib/sqlalchemy/dialects/postgresql/base.py#L2218

After the upgrade to 1.4.x, I see a massive CPU spike on my DB when one of these queries is executed. I’m guessing that this may be due to a new value being created for each comparison but I don’t know enough about PG or SQLA internals to say for certain.

In my case, if an empty tuple was provided, the value of the POSTCOMPILE placeholder resolves to something like (SELECT CAST(NULL AS VARCHAR(40)) WHERE 1!=1). I’m not entirely sure why this causes a massive increase in query time and CPU usage, but it does. The table being queried has over 400 million records in it (NSRL dataset).

With 1.3.x, the empty tuple was actually dropped from the query entirely:

SELECT hashes.id, hashes.md5, hashes.sha1
FROM hashes
WHERE hashes.md5 IN (%(md5_1)s) OR 1 != 1

In 1.4.x, the empty tuple is replaced with the cast:

SELECT hashes.id
FROM hashes
WHERE hashes.md5 IN ([POSTCOMPILE_md5_1]) OR hashes.sha1 IN ([POSTCOMPILE_sha1_1])

Some issues I think may be related: https://github.com/sqlalchemy/sqlalchemy/issues/6290 https://github.com/sqlalchemy/sqlalchemy/issues/6222 https://github.com/sqlalchemy/sqlalchemy/issues/4271

Expected behavior My queries to not take minutes.

To Reproduce

import sqlalchemy
from sqlalchemy import Table, Column, String, Integer, or_

metadata = sqlalchemy.MetaData()

hashes_table = Table(
    'hashes', metadata,
    Column('id', Integer, primary_key=True),
    Column('md5', String(32)),
    Column('sha1', String(40)),
)

sel = sqlalchemy.select(
    hashes_table.columns.id  # 1.4.x
    #hashes.table.columns  # 1.3.x
).where(
    or_(
        hashes_table.columns['md5'].in_(['some_md5_sum']),
        hashes_table.columns['sha1'].in_([])
    )
)

engine = sqlalchemy.create_engine('postgresql://...')
q = sel.compile(dialect=engine.dialect)
print(q)

connection = engine.connect()
results = connection.execute(q).fetchall()
# This will take a really long time in large tables

Error No stack trace available as the query will complete eventually. However, the issue is the inefficient behavior resulting from the query. Specifically, the nested SELECT CAST(...) I think.

Versions.

OS: Any
Python: Python3.7
SQLAlchemy: 1.4.x
Database: PostgreSQL
DBAPI:

Additional context This issue seems to only show up with large tables. My specific use-case is with the NSRL File table from the NSRL RDS set. It contains over 400 million records of common binaries and their cryptographic hashes.

Have a nice day!

Issue Analytics

State:
Created 2 years ago
Comments:30 (23 by maintainers)

Top GitHub Comments

3reactions

b0urb0ncommented, Apr 29, 2021

Confirmed that your fix seems to work fine:

EXPLAIN ANALYZE SELECT *
FROM nsrl_file
WHERE md5 IN ('6507c499f9f66673de194ecf2b1b0c0c', 'a87ff679a2f3e71d9181a67b7542122c') OR sha1 IN (NULL) AND 1 != 1;

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on nsrl_file  (cost=512.70..20557.14 rows=5087 width=116) (actual time=0.292..0.444 rows=164 loops=1)
   Recheck Cond: ((md5)::text = ANY ('{6507c499f9f66673de194ecf2b1b0c0c,a87ff679a2f3e71d9181a67b7542122c}'::text[]))
   ->  Bitmap Index Scan on ix_nsrl_file_md5  (cost=0.00..511.42 rows=5087 width=0) (actual time=0.277..0.277 rows=164 loops=1)
         Index Cond: ((md5)::text = ANY ('{6507c499f9f66673de194ecf2b1b0c0c,a87ff679a2f3e71d9181a67b7542122c}'::text[]))
 Total runtime: 0.475 ms
(5 rows)

Thanks for everyone’s work on this!

1reaction

zzzeekcommented, Apr 28, 2021

OK @CaselIT has another idea that is a lot more self contained if it works…

Top Results From Across the Web

Configuring the Apollo Client cache - Apollo GraphQL Docs

Accidentally using different key fields at different times can cause inconsistencies in the cache. Disabling normalization. You can instruct the InMemoryCache ...

Virtualbox skips cache extraction due to empty cache key ...

Ok! Managed to get it working :) It appears that you have to specify cache_dir in the main config.toml section. Without it the...

/c++/src/app/netcache/message_handler.cpp - NCBI

If it's not given or empty then new key will be // generated. { "key", eNSPT_NCID ... If the blob doesn't exist command...

Controlling the cache key - Amazon CloudFront

With Amazon CloudFront, you can control the cache key for objects that are cached at CloudFront edge locations. The cache key is the...

https://www.ietf.org/tools/idnits?url=https://www....

733 A sender MUST NOT generate an "http" URI with an empty host 734 ... or for 1522 use as a cache key...