question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Managing Identity Columns as Primary Keys in DMS. Feasible ?

See original GitHub issue

Introduction

This issue is an explanation about “why” auto increment primary keys is not handled by DMS, yet. Some solutions exist to ensure a correct sync between several systems, using an identity primary keys for some tables. Theoretically, solutions exist, but in practice (you know… real life) some really complex challenges can emerge from these solutions.

This post is NOT a solution, but more an open discussion on “How he could be done, if possible”.

The challenge of auto increment keys is basically how can we afford any conflict when 2 records from 2 systems are using the same primary key (auto incremented) for a new record. Basically, any delete or update does not have any problem here. The main source of concerns comes from the insertion.

  • Client inserts a new record : Primary Key generated : 42
  • Server inserts a new record : Primary Key generated : 42

Is it an update conflict ? Is it two insertion ? If so, should we change the Client primary key to 43 ? (and what will happens on next insertion ?)

Solutions

Basically we have two solutions, to avoid insertions conflicts with auto increment primary keys:

1) Assigning ranges

The idea is to work with Identity columns as Primary keys and range of ids. For instance, you can define something like that:

  • Server IDS range: 1 - 99 999 999
  • Client 1 IDS range: 100 000 000 - 100 999 999
  • Client 2 IDS range: 101 000 000 - 101 999 999
  • Client 3 IDS range: 102 000 000 - 102 999 999

You can store these values in any configuration table.
When you reach the limit for one client (or server), just change the range for a new one.

  • PRO : It’s working with SQL Server / MySQL and SQLite
  • CONS :
    • We are not thread safe (or more accurately, we are not SQL Sessions Safe)
    • We need to manipulate the seeding of identity columns on each sync

2) Assigning Seed / Step

In the same concept, another technic is to use the IDENTITY (m, n) formula as follows:

  • m = Serial number of a node (Server = 1, First Client = 2, Second Client = 3, etc.)
  • n = Number of replication nodes (Server + All Clients)

For example, we have 4 nodes (n = 4) to sync (1 Server + 3 Clients): The Serial number m will be 1 for Server and 2, 3, 4 for Client – A, Client – B and Client – C respectively.

According to the formula, the data will be populated as follows:

  • Server : IDENTITY (1, 4) - the column data sequence will be 1, 5, 9, 13
  • Client – A: IDENTITY (2, 4) - the column data sequence will be 2, 6, 10, 14
  • Client – B: IDENTITY (3, 4) - the column data sequence will be 3, 7, 11, 15
  • Client – C: IDENTITY (4, 4) - the column data sequence will be 4, 8, 12, 16

You can store the formulas values / increments in any configuration table.

  • PROS : Easy to implement if you already know how much clients you will have
  • CONS :
    • The downside of this method is that you need to now the MAX number of clients you will have in your organization.
    • It does not work with SQLite

Solution we will discuss here

Since the major clients databases used with DMS is SQLite, and since the merge replication from SQL Server uses also this technic, we will discuss the implementation of Solution 1: Assigning Ranges.

DMS possible Implementation

I’ve already worked on the primary keys auto increment implementation in DMS, even if it wasn’t publish so far. I’m going to explain what I’ve done and what are the problems I did not resolve yet

Playing with Identity Columns On SQL Server

Firstly, let’s see what we have, in terms of queries / language / hints, on SQL Server, when we are working with Identity columns: (Obviously, we have many others SQL statements for IDENTITY columns, that we don’t need here)

  • SET IDENTITY INSERT : Allowing the insertion of a specific value in an identity column
  • DBCC CHECKIDENT : Managing Seed and Step for an identity column for a specific table.
  • IDENT_CURRENT : Getting the current identity increment values for a specific table.

Macro implementation in DMS

For the implementation, we will stick to SQL Server only, for now.

scopes_identity table

Firstly, we need to save and handle the ranges for each client and the server. Obviously, we can use a simple table (like scope_info) to stores these values. Nothing really complicated here. Here is a draft of the scope_info_identity table, stored and managed from the server side:

sycn_scope_id scope_table_name scope_identity_min scope_identity_ax scope_threshold
Client_ID_GUID_1 Customer 1 000 000 1 999 999 0.7
Client_ID_GUID_2 Customer 2 000 000 2 999 999 0.7
Client_ID_GUID_1 Product 1 000 000 1 999 999 0.7

Using this kind of tables, we can imagine a sync system where a Client will:

  • Send its last identity value for each table
  • Server will see if the value is above the threshold or not
  • If threshold is overpassed, Server will send back a new range of id for this specific table for the specific Client

Merging rows

Here are coming the problems. Here is a macro version of the actual Insert / Update (well on SQL Server, we are using a MERGE statement) stored procedure used when inserting new rows in any client or server (when dealing with identity primary keys):


SET IDENTITY_INSERT [Product] ON;

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;

Something important to notice about SET IDENTITY_INSERT: It allows us to specify a value for our identity column, and if this value is higher than any stored value, the new internal identity counter is incremented as well.

A small sample:

Select IDENT_CURRENT('Product')
-- Result : 1

Insert into Product (Name) Values ('Bikes')
Select IDENT_CURRENT('Product')
-- Result : 2

SET IDENTITY_INSERT [Product] ON;
Insert into Product (ProductID, Name) Values (1000000, 'Bikes')
SET IDENTITY_INSERT [Product] OFF;
Select IDENT_CURRENT('Product')
-- Result : 1 000 000

Since we need to be sure each clients will stay in its range, we need to handle this behavior.

Client 1: Range 1 000 000 to 1 999 999

Let’s see what happens for Client 1 with range 1 000 000 to 1 999 999 ? We need to ensure that all values inserted in from Client 1 should always be in that range.

But what happens if we have a sync on Client 1 with records coming from Client 3 where range is 3 000 000 to 3 999 999 ? Well, for sure, the internal identity value for that column will be reseed to something between 3 000 000 to 3 999 999. And we need to avoid that.

Here is a macro version to be sure we can avoid this situation:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE [Product] AS [base]
USING [changes] on [Product].[ProductID] = [base].[ProductID]
WHEN MATCHED  THEN UPDATE  ...
WHEN NOT MATCHED THEN INSERT ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Using this technic, we will RESEED the internal increment value to the previous value, before any merge (with potential higher values for the identity column)

This solution leads to several problems …

Multi SQL Sessions insertions

What happens if we have another session between the identity increment saving (SELECT @identity_current = IDENT_CURRENT('Product');) and the RESEED ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Another session will insert a new record and our temp saved value will not be corect. The RESEED will work, but will be wrong as well. Since the reseeding will not be correct, we will have an error raised on next INSERT in the Client 1 database.

On the other part, what will happens if we try to insert something, in another session, just after the MERGE ?

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

-- ANOTHER SESSION TRY TO INSERTS A RECORD HERE IN THE CLIENT 1 DATABASE

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

Here, we may have a new record on Client 1 database, out of the authorized ranges of values…

Using a Serializable transaction.

Serializable transaction is the most protective level of transaction. Can we use it here ?

SET TRANSACTION ISOLATION LEVEL SERIALIZEABLE
SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE ...

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

The serializable transaction won’t save us here, since it will not lock the entire table, and will allow any other session to get a new identity value and insert a new record in the table, potentially as we saw in our last 2 samples…

TABLOCK hint

The TABLOCK hint is more restrictive and will lock the entire table during our MERGE statement

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

It will not save us at all since we still can insert before or after the MERGE statement…

Solutions of both side.

So far, we saw that the problem is coming from multi SQL sessions on the table that is updating during a sync MERGE statement…

FREEZE UI and prevents multi SQL Sessions

Can we say something like: “Hey, when you are doing a sync on this table, DO NOT TRY TO MAKE ANY INSERTS IN IT”

That could potentially works for client databases. You can easily freeze the UI and prevents this situation to happens.

But what will happens on the Server side ? The Server side needs to handle multiple sync at the same time, and we can’t prevent two sync to happens at the same time …

Disabling the internal counter increment

An ideal solution would be to disable the internal increment of the identity current value, during the MERGE statement:

SET IDENTITY_INSERT [Product] ON;
DECLARE @identity_current numeric(18, 0);

-- DISABLE INTERNAL IDENTITY CURRENT INCREMENT HERE, BUT ONLY FOR THIS SESSION
SELECT @identity_current = IDENT_CURRENT('Product');

MERGE Product WITH (TABLOCK)

SET IDENTITY_INSERT [Product] OFF;
DBCC CHECKIDENT ('Product', RESEED, @identity_current)

I guess this is what happens when we are using the NOT FOR REPLICATION option with the SQL MERGE replication.

The “Not For Replication” setting for identity columns allows replication to pass the ID value from the publisher to the subscriber without the subscriber creating a new ID. Since identity columns will always generate a new number when records are inserted, this setting allows you to ensure the values on both your publisher and subscribers stay in synch.

Source : https://www.mssqltips.com/sqlservertip/1274/change-not-for-replication-value-for-sql-server-identity-columns/

Conclusion

I’m not sure I’ve put here all the researchs and tests I’ve made so far on the Sync Identity columns problems, but at least this thread could be a good starting point to discuss about it.

If you have any idea, do not hesitate to share here your thoughts.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
gentledeppcommented, Aug 22, 2021

To be honest - I do consider using sequential numeric values as primary keys in sync scenarios a fundamental design flaw. No one should ever do this. Rather, use sequential guids or the like to improve index clustering. So I am going to stay out of this discussion 😉

0reactions
bearyungcommented, Apr 19, 2023

No that is the main issue, we can’t change the schema at all. Our SQL user doesn’t have those rights. We where using Sqlite-sync in the past but for different reasons I am looking for something different. SQLite-sync uses a identity range per client which is actively updated when the user syncs. And then local triggers to ensure the ID is in the range of the user. I have been thinking about implementing this here also, but time…

For allocating identity range per client approach, the sync server need to estimate the growth rate of the client, how frequent the client will connect to server to update the “used range” and to take the next batch of the available range, and in case the identity range used up all ids, can it overflow? (potentially the overflow part will conflict with other client’s assigned range)

Another approach is to build a middle DB (either local or on the cloud), use DMS perform a 1:1 to sync the sqlite data to the middle DB, then fill back the shop ID in the middle DB, and perform another data sync from the middle DB to the centralized DB (merged with other shops data).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handle IDENTITY columns in AWS DMS: Part 1
To enforce uniqueness on the column, you must include the appropriate constraint, like a primary key or unique key. Another misconception is ...
Read more >
Handle IDENTITY columns in AWS DMS: Part 2
In Part 1 of this series, we discussed how the IDENTITY column is used in different relational database management systems.
Read more >
AWS Data Migration Service (DMS) not moving identity ...
Figured out there is no way i can do this with AWS DMS as it do not import secondary/foreign keys, Indexes and Identity...
Read more >
Why should I create an ID column when I can use others as ...
assigning a primary key data which gets value from user input, can be problematic because people can enter, for say, their social security ......
Read more >
Composite Primary Keys | Jeff Smith Blog
When you define more than one column as your primary key on a table, it is called a composite primary key. And many...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found