question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

consider disabling #4645 for empty in and generate different cache key instead

See original GitHub issue

Describe the bug Somewhere between 1.3.24 and 1.4.0 the treatment of empty tuples used with in_ changed. In 1.3.x, an empty tuple was omitted from the compiled query. In 1.4.x, I see that the new SQLCompiler (or in my specific case the PGCompiler subclass) creates a placeholder with the POSTCOMPILE prefix. I think it does this in https://github.com/sqlalchemy/sqlalchemy/blob/91562f56185e1e69676712ccb109865a6622a538/lib/sqlalchemy/sql/compiler.py#L1920

https://github.com/sqlalchemy/sqlalchemy/blob/7fdaac7b2910b5612420378519b9f60d4649daff/lib/sqlalchemy/dialects/postgresql/base.py#L2218

After the upgrade to 1.4.x, I see a massive CPU spike on my DB when one of these queries is executed. I’m guessing that this may be due to a new value being created for each comparison but I don’t know enough about PG or SQLA internals to say for certain.

In my case, if an empty tuple was provided, the value of the POSTCOMPILE placeholder resolves to something like (SELECT CAST(NULL AS VARCHAR(40)) WHERE 1!=1). I’m not entirely sure why this causes a massive increase in query time and CPU usage, but it does. The table being queried has over 400 million records in it (NSRL dataset).

With 1.3.x, the empty tuple was actually dropped from the query entirely:

SELECT hashes.id, hashes.md5, hashes.sha1
FROM hashes
WHERE hashes.md5 IN (%(md5_1)s) OR 1 != 1

In 1.4.x, the empty tuple is replaced with the cast:

SELECT hashes.id
FROM hashes
WHERE hashes.md5 IN ([POSTCOMPILE_md5_1]) OR hashes.sha1 IN ([POSTCOMPILE_sha1_1])

Some issues I think may be related: https://github.com/sqlalchemy/sqlalchemy/issues/6290 https://github.com/sqlalchemy/sqlalchemy/issues/6222 https://github.com/sqlalchemy/sqlalchemy/issues/4271

Expected behavior My queries to not take minutes.

To Reproduce

import sqlalchemy
from sqlalchemy import Table, Column, String, Integer, or_

metadata = sqlalchemy.MetaData()

hashes_table = Table(
    'hashes', metadata,
    Column('id', Integer, primary_key=True),
    Column('md5', String(32)),
    Column('sha1', String(40)),
)

sel = sqlalchemy.select(
    hashes_table.columns.id  # 1.4.x
    #hashes.table.columns  # 1.3.x
).where(
    or_(
        hashes_table.columns['md5'].in_(['some_md5_sum']),
        hashes_table.columns['sha1'].in_([])
    )
)

engine = sqlalchemy.create_engine('postgresql://...')
q = sel.compile(dialect=engine.dialect)
print(q)

connection = engine.connect()
results = connection.execute(q).fetchall()
# This will take a really long time in large tables

Error No stack trace available as the query will complete eventually. However, the issue is the inefficient behavior resulting from the query. Specifically, the nested SELECT CAST(...) I think.

Versions.

  • OS: Any
  • Python: Python3.7
  • SQLAlchemy: 1.4.x
  • Database: PostgreSQL
  • DBAPI:

Additional context This issue seems to only show up with large tables. My specific use-case is with the NSRL File table from the NSRL RDS set. It contains over 400 million records of common binaries and their cryptographic hashes.

Have a nice day!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:30 (23 by maintainers)

github_iconTop GitHub Comments

3reactions
b0urb0ncommented, Apr 29, 2021

Confirmed that your fix seems to work fine:

EXPLAIN ANALYZE SELECT *
FROM nsrl_file
WHERE md5 IN ('6507c499f9f66673de194ecf2b1b0c0c', 'a87ff679a2f3e71d9181a67b7542122c') OR sha1 IN (NULL) AND 1 != 1;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on nsrl_file  (cost=512.70..20557.14 rows=5087 width=116) (actual time=0.292..0.444 rows=164 loops=1)
   Recheck Cond: ((md5)::text = ANY ('{6507c499f9f66673de194ecf2b1b0c0c,a87ff679a2f3e71d9181a67b7542122c}'::text[]))
   ->  Bitmap Index Scan on ix_nsrl_file_md5  (cost=0.00..511.42 rows=5087 width=0) (actual time=0.277..0.277 rows=164 loops=1)
         Index Cond: ((md5)::text = ANY ('{6507c499f9f66673de194ecf2b1b0c0c,a87ff679a2f3e71d9181a67b7542122c}'::text[]))
 Total runtime: 0.475 ms
(5 rows)

Thanks for everyone’s work on this!

1reaction
zzzeekcommented, Apr 28, 2021

OK @CaselIT has another idea that is a lot more self contained if it works…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuring the Apollo Client cache - Apollo GraphQL Docs
Accidentally using different key fields at different times can cause inconsistencies in the cache. Disabling normalization. You can instruct the InMemoryCache ...
Read more >
Virtualbox skips cache extraction due to empty cache key ...
Ok! Managed to get it working :) It appears that you have to specify cache_dir in the main config.toml section. Without it the...
Read more >
/c++/src/app/netcache/message_handler.cpp - NCBI
If it's not given or empty then new key will be // generated. { "key", eNSPT_NCID ... If the blob doesn't exist command...
Read more >
Controlling the cache key - Amazon CloudFront
With Amazon CloudFront, you can control the cache key for objects that are cached at CloudFront edge locations. The cache key is the...
Read more >
https://www.ietf.org/tools/idnits?url=https://www....
733 A sender MUST NOT generate an "http" URI with an empty host 734 ... or for 1522 use as a cache key...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found