memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany
See original GitHub issueEnvironment
- Python: Python 3.8.5
- pyodbc: 4.0.34, pandas 1.4.2, SQLAlchemy 1.4.32
- OS: ubuntu v20 wls2 and docker container
- DB: azure sql database
- driver: ODBC Driver 18 for SQL Server (same with 17)
Issue
Using nvarchar(max) and fastexecutemany a memory leak is observed. There was a similar leak in the past with 854 and pull 832 that was fixed but this look to be a different issue.
Turning off fastexecutemany or using turbodbc does not exhibit the issue. Also tried with other column types and no leak was observed.
Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.
Code to reproduce.
import os
import pandas as pd
import psutil
import sqlalchemy as sa
import string
from urllib.parse import quote_plus
from datetime import datetime, timedelta
from sqlalchemy import (
create_engine,
text,
MetaData,
Table,
Column,
Unicode,
)
import csv
from io import StringIO
def create_data_table_single(engine, table_name):
metadata = MetaData(engine)
mltable = Table(
table_name,
metadata,
Column("join_type", Unicode, nullable=True),
)
try:
metadata.create_all()
except Exception as e:
print("failed to create table")
def printmem(msg):
rss = process.memory_info().rss / 1048576
vms = process.memory_info().vms / 1048576
print(f"{msg}: rss: {rss:0.1f} MiB, vms: {vms:0.1f} MiB")
def drop_table(conn, table):
conn.execute("DROP TABLE IF EXISTS " + table)
def writedata(filename, num):
f = open(filename, "w")
f.write("join_type\n")
for i in range(num):
f.write("explicit\n")
database_connection_string = "Driver={ODBC Driver 18 for SQL Server};Server=tcp:xxx.database.windows.net,1433;Database=database;Uid=xxxx;Pwd=xxxx;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=45;"
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % quote_plus(database_connection_string), fast_executemany=True)
process = psutil.Process()
printmem("startup")
filename = "data.txt"
writedata(filename, 10000)
printmem("data written")
table="dftest1"
with engine.connect() as conn:
drop_table(conn, table)
create_data_table_single(engine, table)
for iteration in range(100):
f = open(filename, "r")
batch = f.read()
f.close()
reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
rows = [row for row in reader]
df = pd.DataFrame(rows)
df.to_sql(table, engine, index=False, if_exists="append")
printmem(f"iteration {iteration}")
Table creation ddl (code is creating this)
CREATE TABLE [dbo].[dftest1](
[join_type] [nvarchar](max) NULL
)
GO
Results of execution:
(base) root@aadcfd1ba406:/opt/app# python testrepo.py
startup: rss: 114.1 MiB, vms: 501.8 MiB
iteration 0: rss: 125.5 MiB, vms: 518.4 MiB
iteration 1: rss: 127.5 MiB, vms: 520.5 MiB
iteration 2: rss: 129.5 MiB, vms: 522.3 MiB
iteration 3: rss: 130.9 MiB, vms: 523.8 MiB
iteration 4: rss: 132.8 MiB, vms: 525.7 MiB
iteration 5: rss: 134.2 MiB, vms: 527.0 MiB
iteration 6: rss: 135.8 MiB, vms: 528.7 MiB
iteration 7: rss: 137.0 MiB, vms: 530.0 MiB
iteration 8: rss: 138.0 MiB, vms: 531.0 MiB
iteration 9: rss: 139.0 MiB, vms: 532.2 MiB
iteration 10: rss: 140.3 MiB, vms: 533.2 MiB
iteration 11: rss: 141.4 MiB, vms: 534.5 MiB
iteration 12: rss: 142.4 MiB, vms: 535.5 MiB
iteration 13: rss: 143.7 MiB, vms: 536.5 MiB
iteration 14: rss: 144.7 MiB, vms: 537.7 MiB
iteration 15: rss: 145.8 MiB, vms: 538.7 MiB
iteration 16: rss: 146.8 MiB, vms: 539.7 MiB
iteration 17: rss: 147.8 MiB, vms: 541.0 MiB
iteration 18: rss: 149.1 MiB, vms: 542.0 MiB
iteration 19: rss: 150.1 MiB, vms: 543.2 MiB
iteration 20: rss: 151.4 MiB, vms: 544.2 MiB
iteration 21: rss: 152.5 MiB, vms: 545.2 MiB
iteration 22: rss: 153.5 MiB, vms: 546.5 MiB
iteration 23: rss: 154.5 MiB, vms: 547.5 MiB
iteration 24: rss: 155.8 MiB, vms: 548.5 MiB
iteration 25: rss: 156.8 MiB, vms: 549.7 MiB
iteration 26: rss: 157.9 MiB, vms: 550.7 MiB
iteration 27: rss: 158.9 MiB, vms: 552.0 MiB
iteration 28: rss: 160.2 MiB, vms: 553.0 MiB
iteration 29: rss: 161.2 MiB, vms: 554.0 MiB
iteration 30: rss: 162.3 MiB, vms: 555.2 MiB
iteration 31: rss: 163.5 MiB, vms: 556.4 MiB
iteration 32: rss: 164.6 MiB, vms: 557.7 MiB
iteration 33: rss: 165.6 MiB, vms: 558.7 MiB
iteration 34: rss: 166.6 MiB, vms: 559.7 MiB
iteration 35: rss: 167.9 MiB, vms: 560.9 MiB
iteration 36: rss: 169.0 MiB, vms: 561.9 MiB
iteration 37: rss: 170.0 MiB, vms: 562.9 MiB
iteration 38: rss: 171.0 MiB, vms: 564.2 MiB
iteration 39: rss: 172.3 MiB, vms: 565.2 MiB
iteration 40: rss: 173.3 MiB, vms: 566.4 MiB
iteration 41: rss: 174.4 MiB, vms: 567.4 MiB
iteration 42: rss: 175.4 MiB, vms: 568.4 MiB
iteration 43: rss: 176.7 MiB, vms: 569.7 MiB
iteration 44: rss: 177.7 MiB, vms: 570.7 MiB
iteration 45: rss: 178.8 MiB, vms: 571.7 MiB
iteration 46: rss: 180.0 MiB, vms: 572.9 MiB
iteration 47: rss: 181.1 MiB, vms: 573.9 MiB
iteration 48: rss: 182.1 MiB, vms: 575.2 MiB
iteration 49: rss: 183.1 MiB, vms: 576.2 MiB
iteration 50: rss: 184.4 MiB, vms: 577.2 MiB
iteration 51: rss: 185.5 MiB, vms: 578.4 MiB
iteration 52: rss: 186.5 MiB, vms: 579.4 MiB
iteration 53: rss: 187.7 MiB, vms: 580.4 MiB
iteration 54: rss: 188.8 MiB, vms: 581.7 MiB
iteration 55: rss: 189.8 MiB, vms: 582.7 MiB
iteration 56: rss: 190.8 MiB, vms: 583.9 MiB
iteration 57: rss: 192.1 MiB, vms: 584.9 MiB
iteration 58: rss: 193.1 MiB, vms: 585.9 MiB
iteration 59: rss: 194.2 MiB, vms: 587.2 MiB
iteration 60: rss: 195.2 MiB, vms: 588.2 MiB
iteration 61: rss: 196.5 MiB, vms: 589.4 MiB
iteration 62: rss: 197.5 MiB, vms: 590.4 MiB
iteration 63: rss: 198.6 MiB, vms: 591.4 MiB
iteration 64: rss: 199.6 MiB, vms: 592.7 MiB
iteration 65: rss: 200.9 MiB, vms: 593.7 MiB
iteration 66: rss: 201.9 MiB, vms: 594.7 MiB
iteration 67: rss: 202.9 MiB, vms: 595.9 MiB
iteration 68: rss: 204.0 MiB, vms: 596.9 MiB
iteration 69: rss: 205.3 MiB, vms: 598.2 MiB
iteration 70: rss: 206.3 MiB, vms: 599.2 MiB
iteration 71: rss: 207.3 MiB, vms: 600.2 MiB
iteration 72: rss: 208.4 MiB, vms: 601.4 MiB
iteration 73: rss: 209.4 MiB, vms: 602.4 MiB
iteration 74: rss: 210.7 MiB, vms: 603.4 MiB
iteration 75: rss: 211.8 MiB, vms: 604.7 MiB
iteration 76: rss: 212.8 MiB, vms: 605.7 MiB
iteration 77: rss: 213.8 MiB, vms: 606.9 MiB
iteration 78: rss: 215.1 MiB, vms: 608.1 MiB
iteration 79: rss: 216.2 MiB, vms: 609.1 MiB
iteration 80: rss: 217.2 MiB, vms: 610.3 MiB
iteration 81: rss: 218.5 MiB, vms: 611.3 MiB
iteration 82: rss: 219.5 MiB, vms: 612.6 MiB
iteration 83: rss: 220.5 MiB, vms: 613.6 MiB
iteration 84: rss: 221.6 MiB, vms: 614.6 MiB
iteration 85: rss: 222.9 MiB, vms: 615.8 MiB
iteration 86: rss: 223.9 MiB, vms: 616.8 MiB
iteration 87: rss: 224.9 MiB, vms: 617.8 MiB
iteration 88: rss: 226.2 MiB, vms: 619.1 MiB
iteration 89: rss: 227.2 MiB, vms: 620.1 MiB
iteration 90: rss: 228.3 MiB, vms: 621.3 MiB
iteration 91: rss: 229.3 MiB, vms: 622.3 MiB
iteration 92: rss: 230.6 MiB, vms: 623.3 MiB
iteration 93: rss: 231.6 MiB, vms: 624.6 MiB
iteration 94: rss: 232.7 MiB, vms: 625.6 MiB
iteration 95: rss: 233.7 MiB, vms: 626.6 MiB
iteration 96: rss: 235.0 MiB, vms: 627.8 MiB
iteration 97: rss: 236.0 MiB, vms: 628.8 MiB
iteration 98: rss: 237.0 MiB, vms: 630.1 MiB
iteration 99: rss: 238.1 MiB, vms: 631.1 MiB
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Memory overflow when using SQL Alchemy to load data into a ...
1 Answer 1 ... Issue appears to be when writing into nvarchar(max) columns. If sqlalchemy creates your table, it will auto set all...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Py_INCREF(cell)
here looks suspicious since we are storingencoded
notcell
inpParam
:This looks like a copy paste from here, where
cell
is passed intopParam
:Commenting out the
Py_INCREF(cell);
line, i.e.,does not completely stop the leak, but it does slow it down considerably.
Before patch:
After patch: