question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] What if the return mean & sem of status_quo for each iteration is different?

See original GitHub issue

In the online A/B test. We use the BatchTrial mode and set a status_quo(control group). Because of fluctuations in the online metrics, we usually have different mean and sem of status_quo in each iteration.

Suppose we need to optimize a range parameter $a \in [0.0, 1.0]$. The objective metric is video playback time per capita which can only be obtained through online experiments. We started an online experiment with three groups and the control group’s parameter is $a = 0.0$. The experiment went through two iterations, one from Monday to Wednesday and the other from Thursday to Saturday. we get the following data frame:

arm_name trial_index a mean(s) sem time
0_0 0 0.3 1600 0 Mon. - Wed.
0_1 0 0.5 1700 0 Mon. - Wed.
Control 0 0.0 1650 0 Mon. - Wed.
1_0 1 1.0 2500 0 Thu. - Sat.
1_1 1 0.8 2600 0 Thu. - Sat.
Control 1 0.0 2300 0 Thu. - Sat.

(For convenience, assuming that sem of each group is 0)

if I directly use Models.BOTORCH(experiment=exp, data=exp.eval()) to generate a model, It will prompt Warning: [WARNING 03-16 14:37:02] ModelBridge: Status quo status_quo found in data with multiple features. Use status_quo_features to specify which to use. But It can be seen from the results that the mean between multiple iterations cannot be directly compared. So I can’t specify which to use.

My method is to first calculate the relative difference of each group relative to the control group of the corresponding iteration. After processing, I have the following data:

arm_name trial_index a mean(s) sem time
0_0 0 0.3 (1600 - 1650) / 1650 = -0.030 0 Mon. - Wed.
0_1 0 0.5 (1700 - 1650) / 1650 = 0.030 0 Mon. - Wed.
1_0 1 1.0 (2500 - 2300) / 2300 = 0.087 0 Thu. - Sat.
1_1 1 0.8 (2600 - 2300) / 2300 = 0.130 0 Thu. - Sat.
Control 1 0.0 0 0 Thu. - Sat.

Then pass this data to Models.BOTORCH. My question are:

  1. How does Ax handle this scenario?
  2. Is My method correct? If correct, does Ax consider supporting this method internally?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:7
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
blethamcommented, Mar 26, 2020

A general comment:

It’s pretty common to see shifts in the metrics like that from one time period to the other, which does introduce a challenge when trying to do multiple iterations of BO using A/B tests.

There are two approaches that we’ve used to handle this. The first and the easiest is to try relativizing the data for each batch with respect to its control, and see if things are stable across time that way. For example, suppose in my first batch I run (Control, 0_0, 0_1, 0_2, 0_3). Then in the second batch, we will often repeat a few of those arms, and add in a few new ones. So the second batch might be (Control, 0_0, 0_2, 1_1, 1_2) where 1_1 and 1_2 are the new arms, and 0_0 and 0_2 are repeated. (We tend to do large batches, of 20-50 arms, in which case we leave around 5 to be repeated from the initial batch). Then, we compare the “0_0 / control” in the first batch to “0_0 / control” in the second batch. Even though control and 0_0 have both had mean shift, we find that most often the % change from control to 0_0 is relatively stable. If they are stable, then it is safe to compute the % change for all of the arms, and fit a model directly to the % changes across all batches.

There is a diagnostic for checking stability across batches. If you run this command:

from ax.utils.notebook.plotting import render
from ax.plot.diagnostic import interact_batch_comparison

render(interact_batch_comparison(observations=m.get_training_data(), experiment=exp, batch_x=0, batch_y=1))

it will generate a plot that compares the raw values for any repeated arms in batch 0 and batch 1 (as specified by batch_x and batch_y). For instance it might look like this if you have two arms that are present in both batches:

Screenshot-20200326142257-631x561

In this case it is good, because we see that the value of the arms were consistent from batch 0 to batch 1 (in this particular experiment this is after relativizing each batch wrt its control). If the points do not line up on the diagonal, then you will likely not get good model fit when combining across batches. It is really important to have a small set (ideally 4 or 5) of arms that are repeated in every batch to be able to verify this stability.

The second approach is to use the multi-task model. With this approach you don’t have to relativize the data in each batch since it will effectively learn the adjustment across batches for you. The modeling aspects become a little bit more complicated since, as you discovered, you will now need to specify which batch to generate points for, or plots, which is done with the fixed_features input. Another complication is that if you have many batches, this model will get really slow because it is modeling a separate task for each batch; you’ll want to specify a reduced rank. Another thing to note is that when using the multi-task model it is very important (even more important than when relativizing) to have a few arms (e.g. 4 or 5) that are repeated in every batch. This gives the model a fixed reference for learning the adjustment across batches.

Our most typical practice for combining batches in online experiments is the relativization approach, along with repeated arms and verifying with the plot that things are stable after relativization. If things are not stable after relativization, or for other settings like the offline-online, that is when we use the multi-task model.

0reactions
ldworkincommented, May 12, 2020

This is now in OSS as per above commit, so going to close out. Let us know if you have any other questions!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Standard Error of the Mean vs. Standard Deviation
In finance, the SEM daily return of an asset measures the accuracy of the sample mean as an estimate of the long-run (persistent)...
Read more >
Return the result of each iteration in the loop - Stack Overflow
Now, the problem is that I'm printing the result from each loop iteration directly from this method. This beats the point of private...
Read more >
Ecommerce Fashion Industry in 2023: 10 Marketing Trends for ...
Average revenue per user by region projected for 2024 fasion and apparel ... The difference between SEM and SEO involves time and money....
Read more >
135 questions with answers in G*POWER | Science topic
I have conducted a G power analysis for repeated measures (ANOVA) for the calculation of sample size (N). Does 'N' represent the minimum...
Read more >
How to Get a 4.0 GPA and Better Grades, By a Harvard Alum
A 4.0 means an A or A+ in every class, with no exceptions. ... If you're aiming for a 4.0 GPA, I'm guessing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found