Improve LockService functionality and flexibility
See original GitHub issueDescription
Because concurrent update/rollback/etc. operations to a database will interfere with each other, Liquibase has a LockService which we use to coordinate access so only one instance can run against a database at time.
Examples of when you can run into concurrency trouble:
- Liquibase is embedded on startup of an application server, and you have a cluster of machines that may be starting at once
- Liquibase runs as part of a build job, and concurrent builds can happen at once
The StandardLockService is the normal implementation that uses a databasechangeloglock
table. Liquibase inserts a record and commits it at startup time, and if other instances see that record they wait until it is cleared.
The big problem with this standard implementation is that IF the instance doing the update is killed before it commits the clearing of the record, that value is stuck until liquibase clearLocks
is ran.
This problem has been around forever in Liquibase, but has been exasperated in recent years with systems like AWS that auto-kill processes that seem slow to start up. We wrap everything in try/catch blocks, but process killing happens too out-of-band for us to deal with.
Related issues/PRs
Descriptions and proposed fixes:
Moving the overall discussion and design to this ticket.
Requirements (New Configuration Options)
- Add a
liquibase.lockservice.enabled
global configuration value that can be used to disable lock service completely. This will allow easier “BYO Locking” (see options section below) - Add a
liquibase.lockservice.heartbeat.enabled
global configuration value that can be used to enable the heartbeat functionality. For the first release, set this to “false” as a default. Future release will change the default to “true” - Add a
liquibase.lockservice.heartbeat.rate
global configuration value that can be used to control how often to update the heartbeat column. Value is seconds, with a default of “10”. - Add a
liquibase.lockservice.heartbeat.timeout
global configuration value that can be used to control how long to wait until it a separate process can take the lock. Value is in seconds, with a default of “30”.
Requirements (Add Heartbeat support)
Update StandardLockService to use a heartbeat thread to update the heartbeat column. (See “options” section below)
- Existing databasechangelog lock tables should be auto-updated to have an additional, nullable “heartbeat” varchar(255) column.
- Heartbeat thread must gracefully handle errors, and if the thread stops unexpectedly it should cause the overall liquibase operation to stop.
- The heartbeat will update the column to a random number/string based on the rate. The waiting process will watch for changese to this value rather than comparing dates in the heartbeat column to avoid date arithmetic and/or system time issues.
- The SQL to take the lock should take the last heartbeat value they knew of into account and check how many rows are updated. This will handle the case where multiple waiting processes try to force an unlock concurrently. 1.The waiting process cannot take the lock if the heartbeat column is null. This supports the “heartbeat.enabled=false” configuration as well as older version of liquibases
Requirements (General refactoring)
While we are adding this feature, we will also do our 4.x work of reviewing the LockService interface in general to make it more extensible and manageable. This makes it easier to have a single release note of “We changed the lock service to support heartbeats and is also easier to work with”. The changes will be an API breaking change for extensions that wrote their own LockService, but we will include documentation on how to update their code.
- Refactor LockService API to be in line with new 4.x standards (use Scope, extend PluginService, etc.)
- Refactor/centralize the databasechangeloglock table management into the StandardLockService rather than being scattered across special SqlGenerators etc.
Test Criteria
- Running liquibase concurrently with a running update/rollback will wait until the other process is done.
- If the original liquibase process is killed half way through, the 2nd process will take the lock and do the update
- Lock logic works with liquibase.lockservice.enabled=false on both processes
- Lock logic works with liquibase.lockservice.enabled=false on only one process
- Running update on concurrent processes with liquibase.lockservice.enabled=false will not cause one to wait for the other
Not doing
- Not doing anything database specific. If we find that there are cases not handled with the heartbeat logic, we can consider database-specific syntax as a future enhancement.
Options Considered
BYO Locking
One option is to leave the locking to the users. There are many situations where Liquibase is ran within a system where the user knows only one instance can be ran at a time. For example, cloud providers have an “init” phase where the platform ensures just one container runs and Liquibase can be moved to there. Or, updates are done via a managed deployment script or build process which is ensuring only one version of an app is being deployed at once.
In these cases, the lock service is just getting in the way. Ideally, the default lock service should work just fine and whether it runs unnecessarily or not isn’t worth the hassle of turning it off. But, it’s a fallback option to consider. We do have the nodatabaselock extension which could be folded into the main liquibase code with an easy flag to switch to that implementation
Database Session Based Locking
Databases have ways to lock tables for the duration of a connection that automatically unlocks them when the connection closes.
By using that functionality, we can rely on the database to clean up the lock. We don’t even necessarily need the databasechangeloglock table anymore, because we can lock the databasechangelog table itself.
However, this approach is going to be specific to each database since the syntax is different for each database. The semantics of what a “lock” means and the visibility of it also varies by database, so it will require extensive testing to ensure it works as expected everywhere.
A native lock system may break the “lock has been held by X since Y” tracking we can do with the databasechangeloglock table, since that native lock ownership may not be visible to other connections. But, we could still preserve the databasechangelog table for reporting purposes even if we don’t use it for locking purposes anymore.
Examples of lock syntax:
- Postgresql: pg_try_advisory_lock https://www.postgresql.org/docs/9.1/functions-admin.html
- Mysql: lock/unlock tables https://dev.mysql.com/doc/refman/8.0/en/lock-tables.html
- Oracle: lock/unlock tables: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqlj40506.html
- MSSQL: select with tablock https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-table?view=sql-server-ver15
Heartbeat
Rather than relying on a native database approach, we could have a thread within the LockService update the databasechangeloglock table every 30 seconds. If a separate process sees that the heartbeat is over 30 seconds old, it can take the lock.
This avoids the database specific logic, but also requires threads which some application servers do not like creating. But, maybe that is OK still?
Max Lock Time
There have been some suggestions to have a setting where the lock attempt succeeds if the old lock is older than X seconds. I tend to not like this because sometimes updates (especially the first run or when creating indexes on a large table) can just take a long time. To avoid this, the “take lock” has to be long. But, a long enough take-lock setting doesn’t help the use case of cloud providers auto-killing processes and wanting to restart a new one right away. For example, a 30 minute setting may sometimes not be long enough and the lock can be incorrectly taken while at the same time way too long for people to wait for their auto-restart logic to take the lock.
Other options?
Any other ideas on how we could implement the logic?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:5
- Comments:10 (7 by maintainers)
Top GitHub Comments
The plugin https://github.com/blagerweij/liquibase-sessionlock (released mid 2020) now implements your option “Database Session Based Locking”, and does it very well. It works for MySQL, Postgresql and Oracle. Personally, I think it is a much better solution than a timeout based approach, and the price of writing database specific adapters for it is worth paying. (In a way, isn’t one of the primary goals of Liquibase to write database specific adapters for common operations?). In my opinion it would be nice to bring that into core as it is strictly better than the standard lock.
If you are looking at locking issues, I think double checked locking is also worth pursuing. See my comments on #829 and #2105
Thanks for Liquibase, I think it’s a great project.
@stdmitry check out https://github.com/liquibase/liquibase/pull/2190; people are actively working on a resolution that should address your use case (and sorry your application has to get restarted over and over again… sounds a bit painful)