Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bulk upsert - duplicates ORM mssql

See original GitHub issue

Hello,

I am currently reflecting a MSSQL table and doing a bulk insert to it, my problem is that I need to check/avoid duplicates and I don’t know how to do it in SQLAlchemy syntax.

To give you some context, The SQL table is time series for a mortgage lender that contains 57 parameters including things such as “Loan ID”, “Observation date”,“Month on Book”, “Origination date”, “Interest rate”, etc. The clue to identify duplicates is in the “Month on book” since this can only have one row for each month after the “Origination Date” per “Loan ID”. Each Loan ID would have different “Origination dates” and “Observation date”, but the “Month on Book” encoding would be the same for every Loan ID.

For example:

Current date: March 2018

Origination Date: 30 Jan 2018

Observation Date: 30 March 2018

Month on Book: 2 (February would be “1” and Jan would be “0”)

Since the table in MS SQL is already populated I want to insert new rows, but only having one record of month on book (the encoded version of “Observation Date”) per LoanID (just in case someone runs the insert code twice, it doesn’t duplicate all the data)

This is how I am currently doing the whole insert process up to the duplicate exception (see “sqlalchemy_orm_bulk_insert” or “sqlalchemy_core_insert”. Also, I know that the class I created to standardise the connection-insertion process with SQl Alchemy could be a bit crap, so if someone has any suggestions, they would be more than welcomed.

class SQLAlchemy_cashflow():
    def __init__(self,database, user, password, sql_schema, driver, server):
        self.sql_schema = sql_schema
        self.driver = driver
        self.server = server
        self.database = database
        self.user = user
        self.password = password

        params = urllib.parse.quote_plus('Driver={'+str(self.driver)+'};'\
                                    "SERVER="+str(self.server)+";"\
                                    "Database="+str(self.database)+";"
                                    "UID="+str(self.user)+";"\
                                    "PWD="+str(self.password)+";"\
                                    "Trusted_Connection=yes"
                                    )
        engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
        self.engine = engine

        conn = engine.connect()
        metadata = MetaData(schema=sql_schema)
        Base = declarative_base(metadata= metadata)
        Base.metadata.reflect(engine)
        Session = sessionmaker(bind=engine)
        session = Session()

        self.metadata = metadata
        self.Base = Base
        self.session = session


    def reflection(self, sql_table):
        self.sql_table = sql_table
        class MyClass(self.Base):
            __table__ = self.Base.metadata.tables[self.sql_table]
        table_reflected = Table(sql_table, self.metadata, autoload=True, autoload_with = self.engine)
        return MyClass, table_reflected
    

    def sqlalchemy_orm_bulk_insert(self,class_object, df):
        connection("dbo")
        t0 = time.time()
        session.bulk_insert_mappings(class_object,
                                    df.to_dict(orient="record"))
        session.commit()
        print("SQLAlchemy ORM bulk_insert_mappings(): Total time for " + str(len(df)) +
            " records " + str(time.time() - t0) + " secs")


    def sqlalchemy_core(self, class_object, df):
        connection("dbo")
        t0 = time.time()
        engine.execute(class_object.__table__.insert(),
                    df.to_dict(orient="record"))
        print("SQLAlchemy Core: Total time for " + str(len(df)) + " records " +
            str(time.time() - t0) + " secs")

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

jvanascocommented, Jan 8, 2020

I’ve done similar bulk actions/migrations a few times, and basically use a variation of Michael’s approach - but I stay away from the bulk operations api and just use multiple processes.

First:

I write a script to iterate over all the “source” in batches of n (1000?) using the ‘windowed range query’ technique (https://github.com/sqlalchemy/sqlalchemy/wiki/WindowedRangeQuery)

I analyze each item to determine if it is an insert, update, or advanced migration… then do the sql for that item.

Then:

I adapt the script to use Redis as a coordinator to determine the start value for the windowed query’s offset. This allows me to spin up multiple processes of the same migration script. I forget who first suggested this technique, it may have been Simon King.

I usually get an initial 6 processes going before I start testing to see if additional processes increase or decrease efficiency.

This multi-process approach has made many (ver large) bulk operations process in under an hour instead of overnight.

0reactions

felipe0216commented, Jan 17, 2020

Hello @jvanasco,

Thank you so much for taking the time to write this example, it was really helpful! much appreciated.

Top Results From Across the Web

Dynamic Bulk Insert with duplicate record check - Stack Overflow

Does the bulk insert stop pushing in the remainder of records once a duplicate is found? OR does it keep going and finish...

How to Avoid Inserting Duplicate Records in SQL INSERT ...

This article discusses inserting records from another table or tables using an INSERT INTO SELECT statement without duplicate key errors.

Lever T-SQL to handle duplicate rows in SQL Server database ...

Duplicate rows in a SQL Server database table can become a problem. This article shows how to find and handle those duplicate rows....

checking of duplicate entry with BULK INSERT

I want to do the bulk insert , it first should check the duplicate data in the file and then only one data...

Best strategy for bulk insert and ignoring duplicate key ...

I'm using Telerik Data Access and have a use case where I need to bulk insert thousands of rows (as part of XML...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Bulk upsert - duplicates ORM mssql

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Getting sqlalchemy.exc.IntegrityError when executing simultaneous inserts.

or_ and and_() should emit a warning / someday raise an error with no arguments