Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a mechanism to protect the system from running the experiment

See original GitHub issue

As part of a recent discussion, it has been requested we add a new element to the experiment language.

An experiment file declares the blocks that define the experiment protocol: hypothesis, method and rollbacks/remediation. However, as experiments gets played into production, some operational concerns come to light. One of which is to be able to interrupt an execution based on system decisions outside the experiment flow itself. As suggested in the other thread, this could be the case if your production system is under attack or goes through some issues of any sort.

Right now, Chaos Toolkit has no built in mechanism to ask if it should continue running. It’s up to the operator to get that information out of band and interrupt through a signal the chaos process from the outside.

Throughout the discussion, I was initially not in favor of adding such a construct to the language itself when there are alternatives, such as the one we just mentioned (signals) or through controls as they are meant for these orthoginal operational concerns.

But, I finally appreciated the context here. To go into a production system, an experiment has to prove it will be a good citizen. To achieve that it is indeed a good idea to expose, in the experiment, a first class citizen block that says “this is how I ensure I play nice with the system”.

To that end, I suggest something similar to the original thread: a new element called safeguards. This element shall be a sequence of probes with a tolerance. The point made by Alexander about resuing known bricks is spot on. These probes would query the system with a question “can I carry on?” and if the system says “no”, then the experiment should interrupt itself as soon as possible technically.

This is therefore the specification proposal:

Safeguards

The Safeguards element is OPTIONAL. It describes when the experiment MUST be interrupted as soon as possible. 

The Safeguards element is a JSON array of Probe elements.

Each Probe MUST define a tolerance property that acting as a gate mechanism for the experiment to carry on or terminate as soon as possible. Any Probe that does not fall into the tolerance zone MUST interrupt the experiment.

Safeguards MAY declare controls.

Safeguards Probes MUST be executed at least once during the experiment.

In addition, the Chaos Toolkit must accomodate this new element. It is suggested the probes are run in the background during the experiment with a specified frequency. Because the safeguards element can declare controls, they can be manipulated at runtime the same way other elements can. This is mostly useful to disable or change the behavior of a safeguard probe at runtime.

Finally, the chaos run command will grow new flags to define the runtime strategy of the probes: what frequency, should they all fail to interrupt or should just one of them failing have that power?

Issue Analytics

State:
Created 3 years ago
Comments:13 (7 by maintainers)

Top GitHub Comments

2reactions

alexander-gorelikcommented, Oct 22, 2020

I assume the safeguards element will be located before the SSH element. I think it better to give users decide if they want a regular way or concurrent for each probe. In addition, in the method element, it works(not 100% sure) as I described above, so this adds consistency too.

Another thing my college pointed that there is no need to put "background":true,"frequency":60 in the same probe because if you use frequency it means that the probe must run in the background.

The same idea is for tolerance, if the user has tolerance in this element CTK will handle a termination, in case there is no tolerance user can terminate using exit_grafecully or decide to do nothing. I don’t think you need to force behavior like in SSH probes, maybe I wrong here and there are other things to think about.

Another thing that came to my mind that the termination , both CTK initiated and exit_grafecully must be aware of the currently executed element if you before method you don’t need to rollback otherwise you need to rollback.

0reactions

ciaranevanscommented, Aug 6, 2021

According to @Lawouach this is delivered, closing the issue

Top Results From Across the Web

Why control an experiment? - PMC - NCBI

Beyond the methodology, controlling an experiment is critically important to ensure that the observed results are not just random events; they ...

Conducting Experiments – Research Methods in Psychology

It is important to conduct one or more small-scale pilot tests of an experiment to be sure that the procedure works as planned....

Controlled experiments (article) | Khan Academy

A controlled experiment is a scientific test done under controlled conditions, meaning that just one (or a few) factors are changed at a...

The Surprising Power of Online Experiments

One method is to run rigorous A/A tests—that is, test something against itself to ensure that about 95% of the time the system...

Chapter 5-Protecting Your System: Physical Security, from ...

Physical security refers to the protection of building sites and ... Well-conceived plans to secure a building can be initiated without adding undue...