question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bulk upsert - duplicates ORM mssql

See original GitHub issue

Hello,

I am currently reflecting a MSSQL table and doing a bulk insert to it, my problem is that I need to check/avoid duplicates and I don’t know how to do it in SQLAlchemy syntax.

To give you some context, The SQL table is time series for a mortgage lender that contains 57 parameters including things such as “Loan ID”, “Observation date”,“Month on Book”, “Origination date”, “Interest rate”, etc. The clue to identify duplicates is in the “Month on book” since this can only have one row for each month after the “Origination Date” per “Loan ID”. Each Loan ID would have different “Origination dates” and “Observation date”, but the “Month on Book” encoding would be the same for every Loan ID.

For example:

Current date: March 2018

Origination Date: 30 Jan 2018

Observation Date: 30 March 2018

Month on Book: 2 (February would be “1” and Jan would be “0”)

Since the table in MS SQL is already populated I want to insert new rows, but only having one record of month on book (the encoded version of “Observation Date”) per LoanID (just in case someone runs the insert code twice, it doesn’t duplicate all the data)

This is how I am currently doing the whole insert process up to the duplicate exception (see “sqlalchemy_orm_bulk_insert” or “sqlalchemy_core_insert”. Also, I know that the class I created to standardise the connection-insertion process with SQl Alchemy could be a bit crap, so if someone has any suggestions, they would be more than welcomed.

class SQLAlchemy_cashflow():
    def __init__(self,database, user, password, sql_schema, driver, server):
        self.sql_schema = sql_schema
        self.driver = driver
        self.server = server
        self.database = database
        self.user = user
        self.password = password

        params = urllib.parse.quote_plus('Driver={'+str(self.driver)+'};'\
                                    "SERVER="+str(self.server)+";"\
                                    "Database="+str(self.database)+";"
                                    "UID="+str(self.user)+";"\
                                    "PWD="+str(self.password)+";"\
                                    "Trusted_Connection=yes"
                                    )
        engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
        self.engine = engine

        conn = engine.connect()
        metadata = MetaData(schema=sql_schema)
        Base = declarative_base(metadata= metadata)
        Base.metadata.reflect(engine)
        Session = sessionmaker(bind=engine)
        session = Session()

        self.metadata = metadata
        self.Base = Base
        self.session = session


    def reflection(self, sql_table):
        self.sql_table = sql_table
        class MyClass(self.Base):
            __table__ = self.Base.metadata.tables[self.sql_table]
        table_reflected = Table(sql_table, self.metadata, autoload=True, autoload_with = self.engine)
        return MyClass, table_reflected
    

    def sqlalchemy_orm_bulk_insert(self,class_object, df):
        connection("dbo")
        t0 = time.time()
        session.bulk_insert_mappings(class_object,
                                    df.to_dict(orient="record"))
        session.commit()
        print("SQLAlchemy ORM bulk_insert_mappings(): Total time for " + str(len(df)) +
            " records " + str(time.time() - t0) + " secs")


    def sqlalchemy_core(self, class_object, df):
        connection("dbo")
        t0 = time.time()
        engine.execute(class_object.__table__.insert(),
                    df.to_dict(orient="record"))
        print("SQLAlchemy Core: Total time for " + str(len(df)) + " records " +
            str(time.time() - t0) + " secs")


Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jvanascocommented, Jan 8, 2020

I’ve done similar bulk actions/migrations a few times, and basically use a variation of Michael’s approach - but I stay away from the bulk operations api and just use multiple processes.

First:

I write a script to iterate over all the “source” in batches of n (1000?) using the ‘windowed range query’ technique (https://github.com/sqlalchemy/sqlalchemy/wiki/WindowedRangeQuery)

I analyze each item to determine if it is an insert, update, or advanced migration… then do the sql for that item.

Then:

I adapt the script to use Redis as a coordinator to determine the start value for the windowed query’s offset. This allows me to spin up multiple processes of the same migration script. I forget who first suggested this technique, it may have been Simon King.

I usually get an initial 6 processes going before I start testing to see if additional processes increase or decrease efficiency.

This multi-process approach has made many (ver large) bulk operations process in under an hour instead of overnight.

0reactions
felipe0216commented, Jan 17, 2020

Hello @jvanasco,

Thank you so much for taking the time to write this example, it was really helpful! much appreciated.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dynamic Bulk Insert with duplicate record check - Stack Overflow
Does the bulk insert stop pushing in the remainder of records once a duplicate is found? OR does it keep going and finish...
Read more >
How to Avoid Inserting Duplicate Records in SQL INSERT ...
This article discusses inserting records from another table or tables using an INSERT INTO SELECT statement without duplicate key errors.
Read more >
Lever T-SQL to handle duplicate rows in SQL Server database ...
Duplicate rows in a SQL Server database table can become a problem. This article shows how to find and handle those duplicate rows....
Read more >
checking of duplicate entry with BULK INSERT
I want to do the bulk insert , it first should check the duplicate data in the file and then only one data...
Read more >
Best strategy for bulk insert and ignoring duplicate key ...
I'm using Telerik Data Access and have a use case where I need to bulk insert thousands of rows (as part of XML...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found