Add a mechanism to protect the system from running the experiment
See original GitHub issueAs part of a recent discussion, it has been requested we add a new element to the experiment language.
An experiment file declares the blocks that define the experiment protocol: hypothesis, method and rollbacks/remediation. However, as experiments gets played into production, some operational concerns come to light. One of which is to be able to interrupt an execution based on system decisions outside the experiment flow itself. As suggested in the other thread, this could be the case if your production system is under attack or goes through some issues of any sort.
Right now, Chaos Toolkit has no built in mechanism to ask if it should continue running. It’s up to the operator to get that information out of band and interrupt through a signal the chaos
process from the outside.
Throughout the discussion, I was initially not in favor of adding such a construct to the language itself when there are alternatives, such as the one we just mentioned (signals) or through controls as they are meant for these orthoginal operational concerns.
But, I finally appreciated the context here. To go into a production system, an experiment has to prove it will be a good citizen. To achieve that it is indeed a good idea to expose, in the experiment, a first class citizen block that says “this is how I ensure I play nice with the system”.
To that end, I suggest something similar to the original thread: a new element called safeguards
. This element shall be a sequence of probes with a tolerance. The point made by Alexander about resuing known bricks is spot on. These probes would query the system with a question “can I carry on?” and if the system says “no”, then the experiment should interrupt itself as soon as possible technically.
This is therefore the specification proposal:
Safeguards
The Safeguards element is OPTIONAL. It describes when the experiment MUST be interrupted as soon as possible.
The Safeguards element is a JSON array of Probe elements.
Each Probe MUST define a tolerance property that acting as a gate mechanism for the experiment to carry on or terminate as soon as possible. Any Probe that does not fall into the tolerance zone MUST interrupt the experiment.
Safeguards MAY declare controls.
Safeguards Probes MUST be executed at least once during the experiment.
In addition, the Chaos Toolkit must accomodate this new element. It is suggested the probes are run in the background during the experiment with a specified frequency. Because the safeguards element can declare controls, they can be manipulated at runtime the same way other elements can. This is mostly useful to disable or change the behavior of a safeguard probe at runtime.
Finally, the chaos run
command will grow new flags to define the runtime strategy of the probes: what frequency, should they all fail to interrupt or should just one of them failing have that power?
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (7 by maintainers)
Top GitHub Comments
I assume the safeguards element will be located before the SSH element. I think it better to give users decide if they want a regular way or concurrent for each probe. In addition, in the
method element,
it works(not 100% sure) as I described above, so this adds consistency too.Another thing my college pointed that there is no need to put
"background":true,"frequency":60
in the same probe because if you usefrequency
it means that the probe must run in the background.The same idea is for tolerance, if the user has
tolerance
in this element CTK will handle a termination, in case there is notolerance
user can terminate usingexit_grafecully
or decide to do nothing. I don’t think you need to force behavior like in SSH probes, maybe I wrong here and there are other things to think about.Another thing that came to my mind that the termination , both CTK initiated and
exit_grafecully
must be aware of the currently executed element if you before method you don’t need to rollback otherwise you need to rollback.According to @Lawouach this is delivered, closing the issue