Massive overhead when iterating over 1k+ rows in postgres even with server side cursors
See original GitHub issueI’m seeing an inexplicably large overhead when iterating over a postgres table.
I profiled the code, and also did a smoke test with SQLAlchemy
to make sure it wasn’t a slow connection or the underlying driver (psycopg2
).
Running this against a postgres table of ~1M records but fetching only a tiny fraction of that.
import time
import peewee
import sqlalchemy
from playhouse import postgres_ext
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.engine.url import URL as AlchemyURL
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker as alchemy_sessionmaker
user = 'XXX'
password = 'XXX'
database = 'XXX'
host = 'XXX'
port = 5432
table = 'person'
limit = 1000
peewee_db = postgres_ext.PostgresqlExtDatabase(
database=database,
host=host, port=port,
user=user, password=password,
use_speedups=True,
server_side_cursors=True,
register_hstore=False,
)
alchemy_engine = sqlalchemy.create_engine(AlchemyURL('postgresql', username=user, password=password,
database=database, host=host, port=port))
alchemy_session = alchemy_sessionmaker(bind=alchemy_engine)()
class PeeweePerson(peewee.Model):
class Meta:
database = peewee_db
db_table = table
id = peewee.CharField(primary_key=True, max_length=64)
data = postgres_ext.BinaryJSONField(index=True, index_type='GIN')
class SQLAlchemyPerson(declarative_base()):
__tablename__ = table
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
data = sqlalchemy.Column(JSONB)
def run_raw_query():
ids = list(peewee_db.execute_sql(f"SELECT id from {table} order by id desc limit {limit}"))
return ids
def run_peewee_query():
query = PeeweePerson.select(PeeweePerson.id).order_by(PeeweePerson.id.desc()).limit(limit)
ids = list(query.tuples())
return ids
def run_sqlalchemy_query():
query = alchemy_session.query(SQLAlchemyPerson.id).order_by(sqlalchemy.desc(SQLAlchemyPerson.id)).limit(limit)
ids = list(query)
return ids
if __name__ == '__main__':
t0 = time.time()
raw_result = run_raw_query()
t1 = time.time()
print(f'Raw: {t1 - t0}')
t2 = time.time()
sqlalchemy_result = run_sqlalchemy_query()
t3 = time.time()
print(f'SQLAlchemy: {t3 - t2}')
t4 = time.time()
peewee_result = run_peewee_query()
t5 = time.time()
print(f'peewee: {t5 - t4}')
assert raw_result == sqlalchemy_result == peewee_result
Outputs
- With
limit = 1000
:
Raw: 0.02643609046936035
SQLAlchemy: 0.03697466850280762
peewee: 1.0509874820709229
- With
limit = 10000
Raw: 0.15931344032287598
SQLAlchemy: 0.07229042053222656
peewee: 10.82826042175293
Both examples use server side cursors.
I briefly profiled this, and looks like 95%+ of the time is spent calling cursor.fetchone
https://github.com/coleifer/peewee/blob/d8e34b0682d87bd56c1a3636445d9c0fccf2b1e2/peewee.py#L2340
I’ll continue profiling this, but was wondering if you knew what was up?
Issue Analytics
- State:
- Created 6 years ago
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Overhead when iterating over 1k+ rows in postgres using ...
Both examples use server side cursors. I briefly profiled this, and looks like 95%+ of the time is spent calling cursor.fetchone https://github.
Read more >Overhead when iterating over 1k+ rows in postgres using ...
This appears to be related to an inefficiency in the implementation of server-side cursors in Peewee 2.x. Specifically, I think it's because peewee's...
Read more >Impact of Network and Cursor on Query Performance of ...
Knowing the impact of network-related overhead and cursors in PostgreSQL is important not only to alleviate confusion but also to get the ...
Read more >The cursor class — Psycopg 2.7.3.2 documentation
Allows Python code to execute PostgreSQL command in a database session. ... in version 2.4: iterating over a named cursor fetches itersize records...
Read more >Documentation: 15: 43.7. Cursors - PostgreSQL
Rather than executing a whole query at once, it is possible to set up a cursor that encapsulates the query, and then read...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Leaving link to psycopg2 docs for handy reference: http://initd.org/psycopg/docs/usage.html#server-side-cursors
Haven’t had a chance to try it out yet - got pulled into other work.