question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When using to_sql(), continue if duplicate primary keys are detected?

See original GitHub issue

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:417
  • Comments:36 (5 by maintainers)

github_iconTop GitHub Comments

102reactions
rockgcommented, Apr 13, 2017

This should also support the “on duplicate update” mode as well.

38reactions
cgi1commented, May 14, 2019

append_skipdupes would be the perfect way to handle this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas to_sql fails on duplicate primary key - Stack Overflow
to_sql() function. I set if_exists='append' , but my table has primary keys. I'd like to do the equivalent of insert ignore ...
Read more >
Skip inserting duplicate records when using df.to_sql() - Reddit
Is there a way to have df.to_sql() to skip trying to insert duplicates? ... if you violate a unique constraint / primary key...
Read more >
Why am I getting a primary/unique key violation? - SQL Studies
The statement has been terminated. Note that the name of the unique index that is violated is listed along with the duplicate key...
Read more >
Batch Load of SQL Database Table with Primary Key violaions ...
Continue Batch Load of SQL Database Table with Primary Key violaions ... records after duplicates are found and send rejects/duplicates to a ...
Read more >
MySQL INSERT ON DUPLICATE KEY UPDATE
This tutorial shows you how to use MySQL INSERT ON DUPLICATE KEY UPDATE statement effectively by practical examples.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found