tools to pause chain/kernel/vats before security upgrades
See original GitHub issueWhat is the Problem Being Solved?
Imagine we (Agoric) have just received disclosure of a significant security bug in some component of the running chain. How can we safely deploy a fix, without giving attackers time to exploit the problem?
The vulnerability might already be known to the attackers, and they’ve just been waiting for it to become worth exploting (e.g. waiting for a liquidity pool to grow to a juicy size), so they may execute their attack as soon as they see/suspect a fix coming. Or they don’t already know the problem, but can reverse-engineer it from the fix, and then perform the attack before the fix is fully deployed.
The core issue is the non-zero times between the defender’s sequence (learning about a problem, fixing it, deploying the fix) and the attacker’s sequence (learning about a problem, developing an exploit, executing the exploit). This problem exists in distributed systems of all shapes and sizes, but it’s particularly exciting for decentralized systems, where there is no one party with the authority to make a change. The fix may involve changing some parameter within a contract, or upgrading the contract, or upgrading the entire chain. To deploy the fix, some will require transactions sent into the chain (which must make their way through various public queues before execution, giving attackers an opportuinity to front-run them, or MEV threats from observant validators). Deeper fixes require coordinating the validator community to upgrade their software. And both kinds of fixes might be telegraphed by commits to an open-source code repository before they are ready to be deployed. Both of these reveal significant information to the attackers, who may then be able to act before the fix is fully implemented.
A powerful tool to address this is the “snooze button”. A small group can have the power to pause some or all of the chain’s activity, giving a larger group time to develop and deploy a fix. Then, after the fix is deployed, the chain is resumed. The pause event can reveal the existence of a problem, but not the details, reducing the attacker’s advantage. Only the attacker who already knew about a problem and was ready to execute their attack (and can race ahead) can react to the pause event.
Once paused, the defenders can work on the fix in public, or at least they can safely involve a larger group to test the fix and coordinate deployment. This reveals the details to the attackers, but by that point it is too late for them to exploit.
Users of our system care about liveness: knowing that their transactions can’t be blocked forever (at least not without the approval of some larger governance committee). They care that this “snooze button” has a limited duration, perhaps a few days or a few weeks. But we can imagine various “sizes” of snooze buttons, with larger governance requirements over the longer-duration delays.
Categories of Attack, Categories of Fixes
We’re imagining problems that affect components at various scales:
- a single contract has a problem, which could be addressed by changing some parameter
- Pause: pause the contract vat, causing all inbound messages to be queued off to the side
- Fix: allow a high-priority non-paused message to change the parameter
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
- a single contract has a problem, which requires a complete vat/contract upgrade
- Pause: pause the contract vat, queue all inbound messages off to the side
- Fix: perform an upgrade of the vat (#3272)
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
- a collection of contracts have a problem
- Pause: the kernel stops servicing the low-priority queues (#3465), but allows high-priority messages so e.g. liquidation continues but new vault creation is paused
- Fix: vat upgrade, parameter change
- Resume: the kernel resumes servicing the low-priority queues
- the entire swingset kernel has a problem
- Pause: the kernel stops servicing all queues
- Fix: the kernel is upgraded
- Resume: the kernel resumes servicing all queues
- one or more Cosmos-SDK modules have a problem
- Pause: a governance/emergency-pause module tells those modules to reject all txns
- Fix: a governance module modifies some parameter, or the validation software is upgraded
- Resume: the governance/emergency-pause module tells those modules to start accepting txns again
We also imagine fine-grained contract pauses, in which the contract consults a table of what activity should and should not be allowed at any given moment. The contract might reject method invocations when paused, or it might queues them internally. We can imagine contracts registering to hear about updates to the “emergency pause table”, via high-priority update messages. In this approach:
- Pause: use the bridge-device mechanism to send an update, wait for it to be delivered to the contract vat
- Fix: send a message to the contract to change a parameter, or perhaps upgrade the vat entirely
- Resume: update the table, wait for the vat to hear about the update
A similar “pause table” could be used at the Cosmos-SDK level, between Go modules, without using the bridge device.
Most of these pauses would be initiated by a Cosmos-SDK module, which reacts to a quorum of signed transactions from a small “security committee”. This module would then change parameters, send bridge-device updates, and tell the Swingset module how/whether to interact with the kernel. For example, the Swingset module currently calls the swingset controller.run(runPolicy) method during END_BLOCK to perform a bounded amount of work (pulling from all queues in priority order). If the pause type was “stop servicing low-priority queues”, this module would be instructed to instead to controller.run(runPolicy, { onlyServiceQueue: 'high'}) or similar. Timer and mailbox events would still be pushed onto the run-queue, but the low-priority consequences would not happen until the setting was changed.
To maintain liveness, each of these pauses needs to be clearly time-bounded. The Cosmos-SDK module that receives the security committee txn needs to watch the block height and unpause everything when the pause expires. Additional votes (with a larger quorum requirement) might extend the pause if more time is necessary to develop/test/deploy the fix.
Disclosure Timeline
We imagine a sequence like the following:
- security researcher notifies a member of the security team about a potential problem
- security team quietly investigates, concludes the problem is severe enough to warrant the snooze button
- security committee is quietly informed, convinced to snooze, signs the txn, submits the txn
- prepared attacker learns about the upcoming pause, might try to race ahead and deploy attack
- all attackers become aware of the service that is vulnerable, but not the nature of the vuln
- pause txn gets accepted into a block, activity is now paused
- prepared attacker’s race window ends
- security team develops the fix
- might reveal the details by involving more people
- might reveal the details by publishing a fix to version control
- security team tests the fix
- security team publishes the fix
- definitely reveals the details
- for fixes that replace validator software:
- validators examine/consider/test the fix
- somebody submits a governance vote to implement the fix
- vote passes
- validators upgrade software, restart
- activation block height arrives, fix deployed
- for fixes that don’t
- governance/upgrade committee submits the fix txn to the chain
- txn gets accepted into block, executed
- fix deployed
- security committee decides fix is deployed, creates/signs the unpause txn, submits txn
- unpause txn is accepted into a block, executed
- activity resumes
If it looks like the pause window won’t be enough, a larger security committee might have the authority to extend it. We’ll need the pause events to have IDs so the txn that extends it can be easily matched to what is being extended.
The pause event should probably include a CVE or URL to a place where details can be found. The details should be withheld until the fix is deployed.
Subcomponents
- swingset `controller.run(“but only the high-priority queue”)
- swingset
controller.pauseVats(vatIDs),unpause - a pattern for contracts to register for pause events, like they do with governance
- a Cosmos-SDK module to receive the security committee txns and execute pause/unpause
- a pattern for Cosmos-SDK modules to check the pause table and reject txns when disabled
Related
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)

Top Related StackOverflow Question
@warner @jessysaurusrex I made this an epic. Can you the two of you please coordinate on creating the appropriate sub-issues?
The Zoe feature allows the contract to block exercise of a subset of invitation, identified by their description strings. It doesn’t block delivery of arbitrary messages to the contract.