Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

tools to pause chain/kernel/vats before security upgrades

See original GitHub issue

What is the Problem Being Solved?

Imagine we (Agoric) have just received disclosure of a significant security bug in some component of the running chain. How can we safely deploy a fix, without giving attackers time to exploit the problem?

The vulnerability might already be known to the attackers, and they’ve just been waiting for it to become worth exploting (e.g. waiting for a liquidity pool to grow to a juicy size), so they may execute their attack as soon as they see/suspect a fix coming. Or they don’t already know the problem, but can reverse-engineer it from the fix, and then perform the attack before the fix is fully deployed.

The core issue is the non-zero times between the defender’s sequence (learning about a problem, fixing it, deploying the fix) and the attacker’s sequence (learning about a problem, developing an exploit, executing the exploit). This problem exists in distributed systems of all shapes and sizes, but it’s particularly exciting for decentralized systems, where there is no one party with the authority to make a change. The fix may involve changing some parameter within a contract, or upgrading the contract, or upgrading the entire chain. To deploy the fix, some will require transactions sent into the chain (which must make their way through various public queues before execution, giving attackers an opportuinity to front-run them, or MEV threats from observant validators). Deeper fixes require coordinating the validator community to upgrade their software. And both kinds of fixes might be telegraphed by commits to an open-source code repository before they are ready to be deployed. Both of these reveal significant information to the attackers, who may then be able to act before the fix is fully implemented.

A powerful tool to address this is the “snooze button”. A small group can have the power to pause some or all of the chain’s activity, giving a larger group time to develop and deploy a fix. Then, after the fix is deployed, the chain is resumed. The pause event can reveal the existence of a problem, but not the details, reducing the attacker’s advantage. Only the attacker who already knew about a problem and was ready to execute their attack (and can race ahead) can react to the pause event.

Once paused, the defenders can work on the fix in public, or at least they can safely involve a larger group to test the fix and coordinate deployment. This reveals the details to the attackers, but by that point it is too late for them to exploit.

Users of our system care about liveness: knowing that their transactions can’t be blocked forever (at least not without the approval of some larger governance committee). They care that this “snooze button” has a limited duration, perhaps a few days or a few weeks. But we can imagine various “sizes” of snooze buttons, with larger governance requirements over the longer-duration delays.

Categories of Attack, Categories of Fixes

We’re imagining problems that affect components at various scales:

a single contract has a problem, which could be addressed by changing some parameter
- Pause: pause the contract vat, causing all inbound messages to be queued off to the side
- Fix: allow a high-priority non-paused message to change the parameter
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
a single contract has a problem, which requires a complete vat/contract upgrade
- Pause: pause the contract vat, queue all inbound messages off to the side
- Fix: perform an upgrade of the vat (#3272)
- Resume: resume delivery from the side queue, then allow main-queue messages to arrive
a collection of contracts have a problem
- Pause: the kernel stops servicing the low-priority queues (#3465), but allows high-priority messages so e.g. liquidation continues but new vault creation is paused
- Fix: vat upgrade, parameter change
- Resume: the kernel resumes servicing the low-priority queues
the entire swingset kernel has a problem
- Pause: the kernel stops servicing all queues
- Fix: the kernel is upgraded
- Resume: the kernel resumes servicing all queues
one or more Cosmos-SDK modules have a problem
- Pause: a governance/emergency-pause module tells those modules to reject all txns
- Fix: a governance module modifies some parameter, or the validation software is upgraded
- Resume: the governance/emergency-pause module tells those modules to start accepting txns again

We also imagine fine-grained contract pauses, in which the contract consults a table of what activity should and should not be allowed at any given moment. The contract might reject method invocations when paused, or it might queues them internally. We can imagine contracts registering to hear about updates to the “emergency pause table”, via high-priority update messages. In this approach:

Pause: use the bridge-device mechanism to send an update, wait for it to be delivered to the contract vat
Fix: send a message to the contract to change a parameter, or perhaps upgrade the vat entirely
Resume: update the table, wait for the vat to hear about the update

A similar “pause table” could be used at the Cosmos-SDK level, between Go modules, without using the bridge device.

Most of these pauses would be initiated by a Cosmos-SDK module, which reacts to a quorum of signed transactions from a small “security committee”. This module would then change parameters, send bridge-device updates, and tell the Swingset module how/whether to interact with the kernel. For example, the Swingset module currently calls the swingset controller.run(runPolicy) method during END_BLOCK to perform a bounded amount of work (pulling from all queues in priority order). If the pause type was “stop servicing low-priority queues”, this module would be instructed to instead to controller.run(runPolicy, { onlyServiceQueue: 'high'}) or similar. Timer and mailbox events would still be pushed onto the run-queue, but the low-priority consequences would not happen until the setting was changed.

To maintain liveness, each of these pauses needs to be clearly time-bounded. The Cosmos-SDK module that receives the security committee txn needs to watch the block height and unpause everything when the pause expires. Additional votes (with a larger quorum requirement) might extend the pause if more time is necessary to develop/test/deploy the fix.

Disclosure Timeline

We imagine a sequence like the following:

security researcher notifies a member of the security team about a potential problem
security team quietly investigates, concludes the problem is severe enough to warrant the snooze button
security committee is quietly informed, convinced to snooze, signs the txn, submits the txn
- prepared attacker learns about the upcoming pause, might try to race ahead and deploy attack
- all attackers become aware of the service that is vulnerable, but not the nature of the vuln
pause txn gets accepted into a block, activity is now paused
- prepared attacker’s race window ends
security team develops the fix
- might reveal the details by involving more people
- might reveal the details by publishing a fix to version control
security team tests the fix
security team publishes the fix
- definitely reveals the details
for fixes that replace validator software:
- validators examine/consider/test the fix
- somebody submits a governance vote to implement the fix
- vote passes
- validators upgrade software, restart
- activation block height arrives, fix deployed
for fixes that don’t
- governance/upgrade committee submits the fix txn to the chain
- txn gets accepted into block, executed
- fix deployed
security committee decides fix is deployed, creates/signs the unpause txn, submits txn
unpause txn is accepted into a block, executed
activity resumes

If it looks like the pause window won’t be enough, a larger security committee might have the authority to extend it. We’ll need the pause events to have IDs so the txn that extends it can be easily matched to what is being extended.

The pause event should probably include a CVE or URL to a place where details can be found. The details should be withheld until the fix is deployed.

Subcomponents

swingset `controller.run(“but only the high-priority queue”)
swingset controller.pauseVats(vatIDs), unpause
a pattern for contracts to register for pause events, like they do with governance
a Cosmos-SDK module to receive the security committee txns and execute pause/unpause
a pattern for Cosmos-SDK modules to check the pause table and reject txns when disabled

#3528 (pause vats on meter underflow)
#4516 (pause kernel but let cosmic-swingset continue)

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

Tartuffocommented, Feb 23, 2022

@warner @jessysaurusrex I made this an epic. Can you the two of you please coordinate on creating the appropriate sub-issues?

0reactions

Chris-Hibbertcommented, Jul 21, 2022

The Zoe feature allows the contract to block exercise of a subset of invitation, identified by their description strings. It doesn’t block delivery of arbitrary messages to the contract.

Top Results From Across the Web

Final Reminder! Perform Mandatory SAIG Software Upgrade ...

Final Reminder! Perform Mandatory SAIG Software Upgrade and Install EDconnect Security Patch by Feb. 20, 2022 | Knowledge Center.

Pausing and resuming computer protection and control

Pausing computer protection and control means disabling all protection and control components of Kaspersky Endpoint Security for some time. The application ...

Built for you: Customizability, security upgrades, and more

Decide whether a Tooltip appears every time, or just repeats until your users engage, with frequency controls (available in your Tooltip group ...

How to Pause Windows 10 Automatic Updates To Avoid ...

Go to Update & Security. Click 'Choose Advanced options'. Under the 'Pause updates' section, you will see a drop-down menu labeled 'Pause until' ......

Updates being applied even though they are being paused for ...

I'm facing an issue where cumulative & security updates are being applied ... so I tend to pause them for a week until...