Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

It is actually compatible with unique constraint

See original GitHub issue

Hi, ThibTrip. Although in the usage instruction you indicate that pangres.upsert function works by primary key, which must be set as the index of DataFrame. And you also wrote ‘we don’t want autoincremented PK’ in the examples. But I found that it is actually compatible with unique constraint and an autoincremented PK (at least in MySQL 5.7).

Please take a look.

-- Here the `row_id` is the auto-incremented primary key
-- `order_id` and `product_id` make up of the unique constraint
-- let's say a single order can have more than one kind of product

CREATE TABLE `order_info` (
  `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'auto_incremented_ID',
  `order_id` varchar(5) NOT NULL DEFAULT '-9999' COMMENT 'order_id',
  `product_id` varchar(5) NOT NULL DEFAULT '-9999' COMMENT 'product_id',
  `qty` int(11) DEFAULT NULL COMMENT 'purchase_quantity',
  `refund_qty` int(11) DEFAULT NULL COMMENT 'refund_quantity',
  `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'last_update_time',
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'first_create_time',
  PRIMARY KEY (`row_id`),
  UNIQUE KEY `main` (`order_id`,`product_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Order Info'

Insert the origin values first.

order_id	product_id	qty
A0001	PD100	10
A0002	PD200	20
A0002	PD201	22

import pangres

old_data = {'order_id': ['A0001', 'A0002', 'A0002'],
            'product_id': ['PD100', 'PD200', 'PD201'],
            'qty': [10, 20, 22],
            'refund_qty': [0, 0, 0]}
old_df = pd.DataFrame(old_data)
old_df = old_df.set_index(['order_id', 'product_id'])

# lets suppose engine has been defined somehow

pangres.upsert(engine=engine,
               df=old_df,
               table_name='order_info',
               if_row_exists='update')

Then we get

row_id	order_id	product_id	qty	update_time	create_time
1	A0001	PD100	10	2020-06-28 20:31:04	2020-06-28 20:31:04
2	A0002	PD200	20	2020-06-28 20:31:04	2020-06-28 20:31:04
3	A0002	PD201	22	2020-06-28 20:31:04	2020-06-28 20:31:04

Next, upsert the new df as below:

order_id	product_id	qty	refund_qty
A0001	PD100	10	0
A0002	PD200	20	0
A0002	PD201	22	2
A0003	PD300	30	0

new_data = {'order_id': ['A0001', 'A0002', 'A0002', 'A0003'],
            'product_id': ['PD100', 'PD200', 'PD201', 'PD300'],
            'qty': [10, 20, 22, 30],
            'refund_qty': [0, 0, 2, 0]}
new_df = pd.DataFrame(new_data)
new_df = new_df.set_index(['order_id', 'product_id'])

pangres.upsert(engine=engine,
               df=new_df,
               table_name='order_info',
               if_row_exists='update')

The result is completely as expected!

row_id	order_id	product_id	qty	refund_qty	update_time	create_time
1	A0001	PD100	10	0	2020-06-28 20:31:04	2020-06-28 20:31:04
2	A0002	PD200	20	0	2020-06-28 20:31:04	2020-06-28 20:31:04
3	A0002	PD201	22	2	2020-06-28 20:37:13	2020-06-28 20:31:04
4	A0003	PD300	30	0	2020-06-28 20:37:13	2020-06-28 20:37:13

The update_time field only changed in the last two records, while the first two remain what they should be.

I would suggest you add this feature description to the README (and nothing have to change in the code), since I was so excited to find your repo to solve the upsert issue of pandas so nicely , but then turned sad when I read it only supporting primary key without auto increment. Only after I took a closer look at the code and carefully ran a test can I find it actually works with auto increment and unique constraint (at leat in MySQL).

Issue Analytics

State:
Created 3 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

LawrentChencommented, Jul 1, 2020

Thank you for taking my opinion in consideration. I’ve looked at both two PRs and I believe they are clear enough for new users. No need to feel sorry for not asking me before merging, I am already quite satisfied for being involved 😝.

Docker and pytest are right there in my learning roadmap, also considering Kubernetes. Happy to have these right targets. And I will try npdoc_to_md in the future. It seems can be used together with document generator like Sphinx/Jupinx.

And after you finish, I believe this issue can be closed anytime you wish. Have a nice day! 👍

1reaction

ThibTripcommented, Jun 30, 2020

I can’t really remember what was the issue with the auto incrementing key but the comment you were pointing at is merely a module filled with examples which is used for tests, docs and if a user wants to try pangres quickly. I recall that for some reason testing with an auto-incrementing primary key was complicated wich is why I ended up using a instead VARCHAR. But I removed this comment anyways in my new pull request. This PR changes the documentation to indicate we can use unique keys. Actually it does a little more than that I kind of got carried away (e.g. I changed the script for generating the documentation and fixed the yml file for code coverage). Maybe you want to give me your opinion on the documentation changes in the PR before I merge it?

I also added tests with unique keys in a previous PR. I forgot to ask you before merging sorry. Hope the test is what you had in mind. I just removed the timestamp columns because that was unnecessarily complicated for testing. I did see what you mentioned with triggers yes. Fortunately I have never needed such a use case in my work 🙈 or I would just do datetime.datetime.now().astimezone(datetime.timezone.utc) and call it a day (it did not matter much for me but doing that server side would be more accurate and most likely better for performance).

For JupyterHub I suppose using docker should make the task much easier. Obviously you’ll have to learn docker but it should be worth it (plus docker is used everywhere 😐). See docker page on JupyterHub website (they provide the link to the docker image on docker hub with detailed instructions). As for testing I can heavily recommand pytest. It’s very convenient and flexible. I am not sure I understand everything with parameterizing and generating tests though. I have used parameters in my other public library npdoc_to_md and in pangres I just generate tests for each database type. I learnt a few things about pytest by looking at tests in pandas repo. And keep writing issues 👍 . I don’t think everyone does that when they notice something’s not right.