[Question] What if the return mean & sem of status_quo for each iteration is different?
See original GitHub issueIn the online A/B test. We use the BatchTrial mode and set a status_quo(control group). Because of fluctuations in the online metrics, we usually have different mean and sem of status_quo in each iteration.
Suppose we need to optimize a range parameter $a \in [0.0, 1.0]$. The objective metric is video playback time per capita which can only be obtained through online experiments. We started an online experiment with three groups and the control group’s parameter is $a = 0.0$. The experiment went through two iterations, one from Monday to Wednesday and the other from Thursday to Saturday. we get the following data frame:
arm_name | trial_index | a | mean(s) | sem | time |
---|---|---|---|---|---|
0_0 | 0 | 0.3 | 1600 | 0 | Mon. - Wed. |
0_1 | 0 | 0.5 | 1700 | 0 | Mon. - Wed. |
Control | 0 | 0.0 | 1650 | 0 | Mon. - Wed. |
1_0 | 1 | 1.0 | 2500 | 0 | Thu. - Sat. |
1_1 | 1 | 0.8 | 2600 | 0 | Thu. - Sat. |
Control | 1 | 0.0 | 2300 | 0 | Thu. - Sat. |
(For convenience, assuming that sem of each group is 0)
if I directly use Models.BOTORCH(experiment=exp, data=exp.eval())
to generate a model, It will prompt Warning: [WARNING 03-16 14:37:02] ModelBridge: Status quo status_quo found in data with multiple features. Use status_quo_features to specify which to use.
But It can be seen from the results that the mean between multiple iterations cannot be directly compared. So I can’t specify which to use.
My method is to first calculate the relative difference of each group relative to the control group of the corresponding iteration. After processing, I have the following data:
arm_name | trial_index | a | mean(s) | sem | time |
---|---|---|---|---|---|
0_0 | 0 | 0.3 | (1600 - 1650) / 1650 = -0.030 | 0 | Mon. - Wed. |
0_1 | 0 | 0.5 | (1700 - 1650) / 1650 = 0.030 | 0 | Mon. - Wed. |
1_0 | 1 | 1.0 | (2500 - 2300) / 2300 = 0.087 | 0 | Thu. - Sat. |
1_1 | 1 | 0.8 | (2600 - 2300) / 2300 = 0.130 | 0 | Thu. - Sat. |
Control | 1 | 0.0 | 0 | 0 | Thu. - Sat. |
Then pass this data to Models.BOTORCH
. My question are:
- How does Ax handle this scenario?
- Is My method correct? If correct, does Ax consider supporting this method internally?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:7
- Comments:12 (8 by maintainers)
Top GitHub Comments
A general comment:
It’s pretty common to see shifts in the metrics like that from one time period to the other, which does introduce a challenge when trying to do multiple iterations of BO using A/B tests.
There are two approaches that we’ve used to handle this. The first and the easiest is to try relativizing the data for each batch with respect to its control, and see if things are stable across time that way. For example, suppose in my first batch I run (Control, 0_0, 0_1, 0_2, 0_3). Then in the second batch, we will often repeat a few of those arms, and add in a few new ones. So the second batch might be (Control, 0_0, 0_2, 1_1, 1_2) where 1_1 and 1_2 are the new arms, and 0_0 and 0_2 are repeated. (We tend to do large batches, of 20-50 arms, in which case we leave around 5 to be repeated from the initial batch). Then, we compare the “0_0 / control” in the first batch to “0_0 / control” in the second batch. Even though control and 0_0 have both had mean shift, we find that most often the % change from control to 0_0 is relatively stable. If they are stable, then it is safe to compute the % change for all of the arms, and fit a model directly to the % changes across all batches.
There is a diagnostic for checking stability across batches. If you run this command:
it will generate a plot that compares the raw values for any repeated arms in batch 0 and batch 1 (as specified by batch_x and batch_y). For instance it might look like this if you have two arms that are present in both batches:
In this case it is good, because we see that the value of the arms were consistent from batch 0 to batch 1 (in this particular experiment this is after relativizing each batch wrt its control). If the points do not line up on the diagonal, then you will likely not get good model fit when combining across batches. It is really important to have a small set (ideally 4 or 5) of arms that are repeated in every batch to be able to verify this stability.
The second approach is to use the multi-task model. With this approach you don’t have to relativize the data in each batch since it will effectively learn the adjustment across batches for you. The modeling aspects become a little bit more complicated since, as you discovered, you will now need to specify which batch to generate points for, or plots, which is done with the
fixed_features
input. Another complication is that if you have many batches, this model will get really slow because it is modeling a separate task for each batch; you’ll want to specify a reduced rank. Another thing to note is that when using the multi-task model it is very important (even more important than when relativizing) to have a few arms (e.g. 4 or 5) that are repeated in every batch. This gives the model a fixed reference for learning the adjustment across batches.Our most typical practice for combining batches in online experiments is the relativization approach, along with repeated arms and verifying with the plot that things are stable after relativization. If things are not stable after relativization, or for other settings like the offline-online, that is when we use the multi-task model.
This is now in OSS as per above commit, so going to close out. Let us know if you have any other questions!