question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany

See original GitHub issue

Environment

  • Python: Python 3.8.5
  • pyodbc: 4.0.34, pandas 1.4.2, SQLAlchemy 1.4.32
  • OS: ubuntu v20 wls2 and docker container
  • DB: azure sql database
  • driver: ODBC Driver 18 for SQL Server (same with 17)

Issue

Using nvarchar(max) and fastexecutemany a memory leak is observed. There was a similar leak in the past with 854 and pull 832 that was fixed but this look to be a different issue.

Turning off fastexecutemany or using turbodbc does not exhibit the issue. Also tried with other column types and no leak was observed.

Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.

Code to reproduce.

import os
import pandas as pd
import psutil
import sqlalchemy as sa
import string
from urllib.parse import quote_plus
from datetime import datetime, timedelta
from sqlalchemy import (
    create_engine,
    text,
    MetaData,
    Table,
    Column,
    Unicode,
)
import csv
from io import StringIO

def create_data_table_single(engine, table_name):
    metadata = MetaData(engine)
    mltable = Table(
        table_name,
        metadata,
        Column("join_type", Unicode, nullable=True),
    )
    try:
        metadata.create_all()
    except Exception as e:
        print("failed to create table")

def printmem(msg):
    rss = process.memory_info().rss / 1048576
    vms = process.memory_info().vms / 1048576
    print(f"{msg}: rss: {rss:0.1f} MiB, vms: {vms:0.1f} MiB")

def drop_table(conn, table):
    conn.execute("DROP TABLE IF EXISTS " + table)

def writedata(filename, num):
    f = open(filename, "w")
    f.write("join_type\n")
    for i in range(num):
        f.write("explicit\n")

database_connection_string = "Driver={ODBC Driver 18 for SQL Server};Server=tcp:xxx.database.windows.net,1433;Database=database;Uid=xxxx;Pwd=xxxx;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=45;"
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % quote_plus(database_connection_string), fast_executemany=True)
process = psutil.Process()

printmem("startup")
filename = "data.txt"
writedata(filename, 10000)
printmem("data written")
table="dftest1"
with engine.connect() as conn:
    drop_table(conn, table)
create_data_table_single(engine, table)

for iteration in range(100):
    f = open(filename, "r")
    batch = f.read()
    f.close()
    reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
    rows = [row for row in reader]
    df = pd.DataFrame(rows)
    df.to_sql(table, engine, index=False, if_exists="append")
    printmem(f"iteration {iteration}")

Table creation ddl (code is creating this)

CREATE TABLE [dbo].[dftest1](
	[join_type] [nvarchar](max) NULL
)
GO

Results of execution:

(base) root@aadcfd1ba406:/opt/app# python testrepo.py
startup: rss: 114.1 MiB, vms: 501.8 MiB
iteration 0: rss: 125.5 MiB, vms: 518.4 MiB
iteration 1: rss: 127.5 MiB, vms: 520.5 MiB
iteration 2: rss: 129.5 MiB, vms: 522.3 MiB
iteration 3: rss: 130.9 MiB, vms: 523.8 MiB
iteration 4: rss: 132.8 MiB, vms: 525.7 MiB
iteration 5: rss: 134.2 MiB, vms: 527.0 MiB
iteration 6: rss: 135.8 MiB, vms: 528.7 MiB
iteration 7: rss: 137.0 MiB, vms: 530.0 MiB
iteration 8: rss: 138.0 MiB, vms: 531.0 MiB
iteration 9: rss: 139.0 MiB, vms: 532.2 MiB
iteration 10: rss: 140.3 MiB, vms: 533.2 MiB
iteration 11: rss: 141.4 MiB, vms: 534.5 MiB
iteration 12: rss: 142.4 MiB, vms: 535.5 MiB
iteration 13: rss: 143.7 MiB, vms: 536.5 MiB
iteration 14: rss: 144.7 MiB, vms: 537.7 MiB
iteration 15: rss: 145.8 MiB, vms: 538.7 MiB
iteration 16: rss: 146.8 MiB, vms: 539.7 MiB
iteration 17: rss: 147.8 MiB, vms: 541.0 MiB
iteration 18: rss: 149.1 MiB, vms: 542.0 MiB
iteration 19: rss: 150.1 MiB, vms: 543.2 MiB
iteration 20: rss: 151.4 MiB, vms: 544.2 MiB
iteration 21: rss: 152.5 MiB, vms: 545.2 MiB
iteration 22: rss: 153.5 MiB, vms: 546.5 MiB
iteration 23: rss: 154.5 MiB, vms: 547.5 MiB
iteration 24: rss: 155.8 MiB, vms: 548.5 MiB
iteration 25: rss: 156.8 MiB, vms: 549.7 MiB
iteration 26: rss: 157.9 MiB, vms: 550.7 MiB
iteration 27: rss: 158.9 MiB, vms: 552.0 MiB
iteration 28: rss: 160.2 MiB, vms: 553.0 MiB
iteration 29: rss: 161.2 MiB, vms: 554.0 MiB
iteration 30: rss: 162.3 MiB, vms: 555.2 MiB
iteration 31: rss: 163.5 MiB, vms: 556.4 MiB
iteration 32: rss: 164.6 MiB, vms: 557.7 MiB
iteration 33: rss: 165.6 MiB, vms: 558.7 MiB
iteration 34: rss: 166.6 MiB, vms: 559.7 MiB
iteration 35: rss: 167.9 MiB, vms: 560.9 MiB
iteration 36: rss: 169.0 MiB, vms: 561.9 MiB
iteration 37: rss: 170.0 MiB, vms: 562.9 MiB
iteration 38: rss: 171.0 MiB, vms: 564.2 MiB
iteration 39: rss: 172.3 MiB, vms: 565.2 MiB
iteration 40: rss: 173.3 MiB, vms: 566.4 MiB
iteration 41: rss: 174.4 MiB, vms: 567.4 MiB
iteration 42: rss: 175.4 MiB, vms: 568.4 MiB
iteration 43: rss: 176.7 MiB, vms: 569.7 MiB
iteration 44: rss: 177.7 MiB, vms: 570.7 MiB
iteration 45: rss: 178.8 MiB, vms: 571.7 MiB
iteration 46: rss: 180.0 MiB, vms: 572.9 MiB
iteration 47: rss: 181.1 MiB, vms: 573.9 MiB
iteration 48: rss: 182.1 MiB, vms: 575.2 MiB
iteration 49: rss: 183.1 MiB, vms: 576.2 MiB
iteration 50: rss: 184.4 MiB, vms: 577.2 MiB
iteration 51: rss: 185.5 MiB, vms: 578.4 MiB
iteration 52: rss: 186.5 MiB, vms: 579.4 MiB
iteration 53: rss: 187.7 MiB, vms: 580.4 MiB
iteration 54: rss: 188.8 MiB, vms: 581.7 MiB
iteration 55: rss: 189.8 MiB, vms: 582.7 MiB
iteration 56: rss: 190.8 MiB, vms: 583.9 MiB
iteration 57: rss: 192.1 MiB, vms: 584.9 MiB
iteration 58: rss: 193.1 MiB, vms: 585.9 MiB
iteration 59: rss: 194.2 MiB, vms: 587.2 MiB
iteration 60: rss: 195.2 MiB, vms: 588.2 MiB
iteration 61: rss: 196.5 MiB, vms: 589.4 MiB
iteration 62: rss: 197.5 MiB, vms: 590.4 MiB
iteration 63: rss: 198.6 MiB, vms: 591.4 MiB
iteration 64: rss: 199.6 MiB, vms: 592.7 MiB
iteration 65: rss: 200.9 MiB, vms: 593.7 MiB
iteration 66: rss: 201.9 MiB, vms: 594.7 MiB
iteration 67: rss: 202.9 MiB, vms: 595.9 MiB
iteration 68: rss: 204.0 MiB, vms: 596.9 MiB
iteration 69: rss: 205.3 MiB, vms: 598.2 MiB
iteration 70: rss: 206.3 MiB, vms: 599.2 MiB
iteration 71: rss: 207.3 MiB, vms: 600.2 MiB
iteration 72: rss: 208.4 MiB, vms: 601.4 MiB
iteration 73: rss: 209.4 MiB, vms: 602.4 MiB
iteration 74: rss: 210.7 MiB, vms: 603.4 MiB
iteration 75: rss: 211.8 MiB, vms: 604.7 MiB
iteration 76: rss: 212.8 MiB, vms: 605.7 MiB
iteration 77: rss: 213.8 MiB, vms: 606.9 MiB
iteration 78: rss: 215.1 MiB, vms: 608.1 MiB
iteration 79: rss: 216.2 MiB, vms: 609.1 MiB
iteration 80: rss: 217.2 MiB, vms: 610.3 MiB
iteration 81: rss: 218.5 MiB, vms: 611.3 MiB
iteration 82: rss: 219.5 MiB, vms: 612.6 MiB
iteration 83: rss: 220.5 MiB, vms: 613.6 MiB
iteration 84: rss: 221.6 MiB, vms: 614.6 MiB
iteration 85: rss: 222.9 MiB, vms: 615.8 MiB
iteration 86: rss: 223.9 MiB, vms: 616.8 MiB
iteration 87: rss: 224.9 MiB, vms: 617.8 MiB
iteration 88: rss: 226.2 MiB, vms: 619.1 MiB
iteration 89: rss: 227.2 MiB, vms: 620.1 MiB
iteration 90: rss: 228.3 MiB, vms: 621.3 MiB
iteration 91: rss: 229.3 MiB, vms: 622.3 MiB
iteration 92: rss: 230.6 MiB, vms: 623.3 MiB
iteration 93: rss: 231.6 MiB, vms: 624.6 MiB
iteration 94: rss: 232.7 MiB, vms: 625.6 MiB
iteration 95: rss: 233.7 MiB, vms: 626.6 MiB
iteration 96: rss: 235.0 MiB, vms: 627.8 MiB
iteration 97: rss: 236.0 MiB, vms: 628.8 MiB
iteration 98: rss: 237.0 MiB, vms: 630.1 MiB
iteration 99: rss: 238.1 MiB, vms: 631.1 MiB

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
nsavoirecommented, Sep 21, 2022

Py_INCREF(cell) here looks suspicious since we are storing encoded not cell in pParam:

// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
Py_INCREF(cell);
pParam->cell = encoded.Detach();

This looks like a copy paste from here, where cell is passed into pParam:

Py_INCREF(cell);
DAEParam *pParam = (DAEParam*)*outbuf;
pParam->cell = cell;
0reactions
gordthompsoncommented, Sep 22, 2022

Commenting out the Py_INCREF(cell); line, i.e.,

@@ -344,11 +344,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
             len = PyBytes_GET_SIZE(encoded);
             if (!pi->ColumnSize)
             {
                 // DAE
                 DAEParam *pParam = (DAEParam*)*outbuf;
-                Py_INCREF(cell);
+                // Py_INCREF(cell);
                 pParam->cell = encoded.Detach();
                 pParam->maxlen = cur->cnxn->GetMaxLength(pi->ValueType);
                 *outbuf += sizeof(DAEParam);
                 ind = cur->cnxn->need_long_data_len ? SQL_LEN_DATA_AT_EXEC((SQLLEN)len) : SQL_DATA_AT_EXEC;
             }

does not completely stop the leak, but it does slow it down considerably.

Before patch:

iteration 0: rss: 81.0 MiB, vms: 275.7 MiB
iteration 1: rss: 84.0 MiB, vms: 278.5 MiB
iteration 2: rss: 85.7 MiB, vms: 280.4 MiB
iteration 3: rss: 87.5 MiB, vms: 282.2 MiB
iteration 4: rss: 89.1 MiB, vms: 283.7 MiB
iteration 5: rss: 90.7 MiB, vms: 285.4 MiB
iteration 6: rss: 91.7 MiB, vms: 286.6 MiB
iteration 7: rss: 93.0 MiB, vms: 287.8 MiB
iteration 8: rss: 94.0 MiB, vms: 288.8 MiB
iteration 9: rss: 95.1 MiB, vms: 289.8 MiB
iteration 10: rss: 96.1 MiB, vms: 291.1 MiB
iteration 11: rss: 97.4 MiB, vms: 292.1 MiB
iteration 12: rss: 98.4 MiB, vms: 293.1 MiB
iteration 13: rss: 99.5 MiB, vms: 294.3 MiB
iteration 14: rss: 100.5 MiB, vms: 295.3 MiB
iteration 15: rss: 101.8 MiB, vms: 296.6 MiB
iteration 16: rss: 102.8 MiB, vms: 297.6 MiB
iteration 17: rss: 103.8 MiB, vms: 298.6 MiB
iteration 18: rss: 104.9 MiB, vms: 299.8 MiB
iteration 19: rss: 106.2 MiB, vms: 300.8 MiB
…
iteration 80: rss: 173.0 MiB, vms: 367.8 MiB
iteration 81: rss: 174.3 MiB, vms: 369.0 MiB
iteration 82: rss: 175.3 MiB, vms: 370.0 MiB
iteration 83: rss: 176.3 MiB, vms: 371.0 MiB
iteration 84: rss: 177.4 MiB, vms: 372.3 MiB
iteration 85: rss: 178.9 MiB, vms: 373.6 MiB
iteration 86: rss: 179.9 MiB, vms: 374.9 MiB
iteration 87: rss: 181.0 MiB, vms: 375.9 MiB
iteration 88: rss: 182.3 MiB, vms: 376.9 MiB
iteration 89: rss: 183.3 MiB, vms: 378.1 MiB
iteration 90: rss: 184.3 MiB, vms: 379.1 MiB
iteration 91: rss: 185.6 MiB, vms: 380.1 MiB
iteration 92: rss: 186.7 MiB, vms: 381.4 MiB
iteration 93: rss: 187.8 MiB, vms: 382.4 MiB
iteration 94: rss: 188.8 MiB, vms: 383.6 MiB
iteration 95: rss: 189.8 MiB, vms: 384.6 MiB
iteration 96: rss: 191.1 MiB, vms: 385.6 MiB
iteration 97: rss: 192.1 MiB, vms: 386.9 MiB
iteration 98: rss: 193.2 MiB, vms: 387.9 MiB
iteration 99: rss: 194.4 MiB, vms: 388.9 MiB

After patch:

iteration 0: rss: 80.9 MiB, vms: 275.7 MiB
iteration 1: rss: 81.8 MiB, vms: 276.5 MiB
iteration 2: rss: 83.2 MiB, vms: 277.9 MiB
iteration 3: rss: 84.7 MiB, vms: 279.4 MiB
iteration 4: rss: 85.1 MiB, vms: 279.9 MiB
iteration 5: rss: 85.1 MiB, vms: 279.9 MiB
iteration 6: rss: 85.3 MiB, vms: 280.1 MiB
iteration 7: rss: 85.2 MiB, vms: 280.1 MiB
iteration 8: rss: 85.2 MiB, vms: 280.1 MiB
iteration 9: rss: 85.2 MiB, vms: 280.1 MiB
iteration 10: rss: 85.2 MiB, vms: 280.1 MiB
iteration 11: rss: 85.2 MiB, vms: 280.1 MiB
iteration 12: rss: 85.2 MiB, vms: 280.1 MiB
iteration 13: rss: 85.2 MiB, vms: 280.1 MiB
iteration 14: rss: 85.5 MiB, vms: 280.1 MiB
iteration 15: rss: 85.4 MiB, vms: 280.3 MiB
iteration 16: rss: 85.4 MiB, vms: 280.5 MiB
iteration 17: rss: 85.4 MiB, vms: 280.5 MiB
iteration 18: rss: 85.4 MiB, vms: 280.5 MiB
iteration 19: rss: 85.7 MiB, vms: 280.5 MiB
…
iteration 80: rss: 86.3 MiB, vms: 281.2 MiB
iteration 81: rss: 86.3 MiB, vms: 281.2 MiB
iteration 82: rss: 86.3 MiB, vms: 281.2 MiB
iteration 83: rss: 86.3 MiB, vms: 281.2 MiB
iteration 84: rss: 86.3 MiB, vms: 281.2 MiB
iteration 85: rss: 86.3 MiB, vms: 281.2 MiB
iteration 86: rss: 86.3 MiB, vms: 281.2 MiB
iteration 87: rss: 86.3 MiB, vms: 281.2 MiB
iteration 88: rss: 86.3 MiB, vms: 281.2 MiB
iteration 89: rss: 86.3 MiB, vms: 281.2 MiB
iteration 90: rss: 86.3 MiB, vms: 281.2 MiB
iteration 91: rss: 86.3 MiB, vms: 281.2 MiB
iteration 92: rss: 86.3 MiB, vms: 281.4 MiB
iteration 93: rss: 86.3 MiB, vms: 281.4 MiB
iteration 94: rss: 86.6 MiB, vms: 281.4 MiB
iteration 95: rss: 86.6 MiB, vms: 281.4 MiB
iteration 96: rss: 86.6 MiB, vms: 281.4 MiB
iteration 97: rss: 86.6 MiB, vms: 281.4 MiB
iteration 98: rss: 86.6 MiB, vms: 281.4 MiB
iteration 99: rss: 86.6 MiB, vms: 281.4 MiB
Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory overflow when using SQL Alchemy to load data into a ...
1 Answer 1 ... Issue appears to be when writing into nvarchar(max) columns. If sqlalchemy creates your table, it will auto set all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found