When using to_sql(), continue if duplicate primary keys are detected?
See original GitHub issueCode Sample, a copy-pastable example if possible
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
Problem description
I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql()
completely stops executing if even one duplicate is detected.
It would make sense for to_sql(if_exists='append')
to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.
Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists
option like 'append_skipdupes'
?
Output of pd.show_versions()
pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:417
- Comments:36 (5 by maintainers)
Top GitHub Comments
This should also support the “on duplicate update” mode as well.
append_skipdupes
would be the perfect way to handle this.