Bulk upsert - duplicates ORM mssql
See original GitHub issueHello,
I am currently reflecting a MSSQL table and doing a bulk insert to it, my problem is that I need to check/avoid duplicates and I don’t know how to do it in SQLAlchemy syntax.
To give you some context, The SQL table is time series for a mortgage lender that contains 57 parameters including things such as “Loan ID”, “Observation date”,“Month on Book”, “Origination date”, “Interest rate”, etc. The clue to identify duplicates is in the “Month on book” since this can only have one row for each month after the “Origination Date” per “Loan ID”. Each Loan ID would have different “Origination dates” and “Observation date”, but the “Month on Book” encoding would be the same for every Loan ID.
For example:
Current date: March 2018
Origination Date: 30 Jan 2018
Observation Date: 30 March 2018
Month on Book: 2 (February would be “1” and Jan would be “0”)
Since the table in MS SQL is already populated I want to insert new rows, but only having one record of month on book (the encoded version of “Observation Date”) per LoanID (just in case someone runs the insert code twice, it doesn’t duplicate all the data)
This is how I am currently doing the whole insert process up to the duplicate exception (see “sqlalchemy_orm_bulk_insert” or “sqlalchemy_core_insert”. Also, I know that the class I created to standardise the connection-insertion process with SQl Alchemy could be a bit crap, so if someone has any suggestions, they would be more than welcomed.
class SQLAlchemy_cashflow():
def __init__(self,database, user, password, sql_schema, driver, server):
self.sql_schema = sql_schema
self.driver = driver
self.server = server
self.database = database
self.user = user
self.password = password
params = urllib.parse.quote_plus('Driver={'+str(self.driver)+'};'\
"SERVER="+str(self.server)+";"\
"Database="+str(self.database)+";"
"UID="+str(self.user)+";"\
"PWD="+str(self.password)+";"\
"Trusted_Connection=yes"
)
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
self.engine = engine
conn = engine.connect()
metadata = MetaData(schema=sql_schema)
Base = declarative_base(metadata= metadata)
Base.metadata.reflect(engine)
Session = sessionmaker(bind=engine)
session = Session()
self.metadata = metadata
self.Base = Base
self.session = session
def reflection(self, sql_table):
self.sql_table = sql_table
class MyClass(self.Base):
__table__ = self.Base.metadata.tables[self.sql_table]
table_reflected = Table(sql_table, self.metadata, autoload=True, autoload_with = self.engine)
return MyClass, table_reflected
def sqlalchemy_orm_bulk_insert(self,class_object, df):
connection("dbo")
t0 = time.time()
session.bulk_insert_mappings(class_object,
df.to_dict(orient="record"))
session.commit()
print("SQLAlchemy ORM bulk_insert_mappings(): Total time for " + str(len(df)) +
" records " + str(time.time() - t0) + " secs")
def sqlalchemy_core(self, class_object, df):
connection("dbo")
t0 = time.time()
engine.execute(class_object.__table__.insert(),
df.to_dict(orient="record"))
print("SQLAlchemy Core: Total time for " + str(len(df)) + " records " +
str(time.time() - t0) + " secs")
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
I’ve done similar bulk actions/migrations a few times, and basically use a variation of Michael’s approach - but I stay away from the bulk operations api and just use multiple processes.
First:
I write a script to iterate over all the “source” in batches of n (1000?) using the ‘windowed range query’ technique (https://github.com/sqlalchemy/sqlalchemy/wiki/WindowedRangeQuery)
I analyze each item to determine if it is an insert, update, or advanced migration… then do the sql for that item.
Then:
I adapt the script to use Redis as a coordinator to determine the start value for the windowed query’s offset. This allows me to spin up multiple processes of the same migration script. I forget who first suggested this technique, it may have been Simon King.
I usually get an initial 6 processes going before I start testing to see if additional processes increase or decrease efficiency.
This multi-process approach has made many (ver large) bulk operations process in under an hour instead of overnight.
Hello @jvanasco,
Thank you so much for taking the time to write this example, it was really helpful! much appreciated.