question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reading table with chunksize still pumps the memory

See original GitHub issue

I’m trying to migrate database tables from MySQL to SQL Server:

import pandas as pd
from sqlalchemy import create_engine

my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen")
ms_engine = create_engine('mssql+pyodbc://localhost/gen?driver=SQL Server')

for table_name in ['topics', 'fiction', 'compact']:
    for table in pd.read_sql_query('SELECT * FROM %s' % table_name,
                                   my_engine, 
                                   chunksize=100000):

        table.to_sql(name=table_name, con=ms_engine, if_exists='append')

I thought that using chunksize would release the memory, but it’s just growing up. I tried also garbage collector, but it has no effect.

Maybe my expectations were wrong?

I’m using Python 3.5.1 with pandas 0.17.1 and all latest packages, although I tried also Python 2.7 with pandas 0.16 and same results

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

27reactions
alfonsomhccommented, Jun 28, 2017

I see that server side cursors are supported in sqlalchemy now (New in version 1.1.4): http://docs.sqlalchemy.org/en/latest/dialects/mysql.html#server-side-cursors

I have verified that

engine = create_engine('mysql+pymysql://user:password@domain/database', server_side_cursors=True)
result = engine.execute(sql_query)
result.fetchone()

returns a row inmediately (i.e. the client doesn’t read the complete table in memory). This should be useful to allow read_sql to read in chunks and avoid memory problems. Passing the parameter chunk to fetchmany: result.fetchmany(chunk) should do the trick?

16reactions
klonuocommented, Feb 14, 2016

Several days later, for reference…

Alembic was too complicated for my concentration. I tried FME and Navicat apps, and while later didn’t manage to make migration through “Data transfer” for all tables, former migrated successfully, but although MySQL tables were encoded in UTF-8 it didn’t use nvarchar data type for SQL Server, so I got records with garbage characters. On top of it no index was preserved.

So I used Python (^_^):

#!/usr/bin/env python3

import pandas as pd
from sqlalchemy import create_engine

my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen?charset=utf8")
ms_engine = create_engine('mssql+pyodbc://localhost/gen?driver=SQL Server')

chunksize = 10000
for table_name in ['topics', 'fiction', 'compact']:

    row_count = int(pd.read_sql('SELECT COUNT(*) FROM {table_name}'.format(
        table_name=table_name), my_engine).values)

    for i in range(int(row_count / chunksize) + 1):
        query = 'SELECT * FROM {table_name} LIMIT {offset}, {chunksize}'.format(
            table_name=table_name, offset=i * chunksize, chunksize=chunksize)

        pd.read_sql_query(query, con=my_engine).to_sql(
            name=table_name, con=ms_engine, if_exists='append', index=False)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Reading a SQL table by chunks with Pandas
Indeed, Pandas is usually allocating a lot more memory than the table data ... The table that we are reading has 1000000 rows,...
Read more >
Reducing Pandas memory usage #3: Reading in chunks
Reduce Pandas memory usage by loading and then processing a file in chunks rather than all at once, using Pandas' chunksize option.
Read more >
11.1 Creating a New LOB Column - Oracle Help Center
You can provide the LOB storage characteristics when creating a LOB column using the CREATE TABLE statement or the ALTER TABLE ADD COLUMN...
Read more >
Connecting Pandas to a Database with SQLAlchemy
Save Pandas DataFrames into SQL database tables, or create DataFrames from SQL using Pandas' built-in SQLAlchemy integration.
Read more >
Architecture and Design — khmer 1.0 documentation - khmer software
Data pumps stage data from disk storage into an in-memory cache. ... The read parsers and the layers under them can be controlled...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found