Home
Contact Us
Job Descriptions
Cool Things
Talks and Presentations
ExP Tools
What's a HiPPO?
Three Approaches to MultiVariable Testing Online
                          Roger Longbotham, December 11, 2007

This paper will look at MultiVariable testing (MVT) from a very high level. First, I will give my perspectives on the benefits and limitations of MVT versus A/B or one factor at a time testing. It is helpful to keep these in mind since the evaluation of the three approaches should to be in regard to how much they take advantage of the potential benefits and mitigate the limitations.

I see two primary benefits of a single MVT versus multiple A/B tests to test the same factors:
  1. You can test many factors in a short period of time, accelerating improvement
  2. You can estimate interactions between factors.


Three common limitations are:

  1. Some combinations of factors may give a poor user experience. For example, two factors being tested for an online retailer may be enlarging a product image or providing additional product detail. Both may improve sales when tested individually, but when both are done the “buy box” is pushed below the fold and sales decrease. If this is caught in the planning phase these two factors would not be tested at the same. If it’s not caught, the MVT would detect a (potentially large) negative interaction. It is certainly possible in a case such as this that the interaction is so large that the two main effects are negative as well even though either factor, if tested alone, would be positive.
  2. Analysis and interpretation are more difficult. For a single factor test you typically have many metrics for the treatment-control comparison. For an MVT you have the same metrics for many treatment-control comparisons (at least one for each factor being tested) plus the analysis and interpretation of the interactions between the factors. Certainly, the information set is much richer but it can make the task of assessing which treatments to roll out more complex.
  3. It can take longer to begin the test. If you have seven factors you want to test and plan to test them one at a time you can start with any of those that are ready to be tested and test the others later. With an MVT you must have all seven ready for testing at the beginning of the test. If only one is delayed, this would delay the start of the test.


I don’t believe any of the limitations are serious ones in most cases, but they should be recognized before conducting an MVT. Generally, I believe the first test one does should be an A/B test mainly due to the complexity of testing more than one factor in the same test.


In this paper I will be discussing three overarching philosophies to conducting MVTs with online properties.

 

Traditional MVT


This approach uses designs that are used in manufacturing and other offline applications. These designs are sometimes known as Taguchi designs even though they were in use long before Taguchi began using them. They are almost always fractional factorial (Davies, 1950) and Plackett-Burman (Plackett, 1946) designs that are specific subsets of full factorial designs (all combinations of factor levels). The user must be careful to choose a design that will have sufficient resolution to estimate the main effects and interactions that are of interest.


A commonly used design is one where seven two-level factors are tested with eight groups of users (treatment combinations). A full factorial for these factors would be 128 treatment combinations so the 8 treatment combination design is a 1/16th fraction of a full factorial. This design can estimate all seven main effects but cannot estimate any interactions (Box, 2005 pp. 235-305). For many experimenters one of the primary reasons for running an MVT is to estimate the interactions among the factors being tested. It is literally impossible to estimate any interactions with this design since all interactions are totally confounded with the main effects. (In fact, if any of the factors are interacting the main effects are questionable because of the confounding.) No amount of effort at analysis or data mining will allow you to estimate the interactions in addition to the main effects. The information needed to estimate them simply isn’t there. If you want to estimate all two factor interactions with seven factors you will need a fractional factorial design with 64 treatment combinations; anything less than that will have two factor interactions confounded with each other and/or main effects. Therefore, unless you run a large design with half as many treatment combinations as a full factorial you cannot estimate interactions, which is one of the benefits of running MVTs.


Instead I am recommending two alternatives that I believe are better than the traditional MVT approach. The one you prefer will depend on how highly you value estimating interactions.


MVT by running concurrent tests


Fractions of the full factorial are used in offline testing because there is usually a cost to using more treatment combinations even when the number of experimental units does not increase. This does not have to be the case with tests conducted with internet sites. If we set up each factor to run as a one-factor experiment we can simplify our efforts and get a full factorial in the end. In this mode we start and stop all these one-factor tests at the same time on the same set of users with users being independently randomly assigned to each experiment. The end result is you will have a full factorial in all the factors you are testing. Of course, with a full factorial you will be able to estimate any interaction you want. A side benefit of this approach is that you can turn off any factor at any time (for example if a factor treatment is bombing) without affecting the other factors. The experiment that includes the remaining factors is not affected.


One misconception I have heard repeated a number of times is that the power of the experiment decreases with the number of treatment combinations (cells). This is false. If your sample size (e.g. number of users) is fixed, it doesn’t matter if you are testing a single factor or many or whether you are conducting an eight run MVT or a full factorial. The power to detect a difference for any main effect is the same. You can find the theory to support this in the statistical literature on design of experiments (e.g. Box, 2005). There are two things that will decrease your power, though. One is increasing the number of levels (variants) for a factor. This will effectively decrease the sample size for any comparison you want to make. The other is to assign less than 50% of the test population to the treatment (if there are two levels).


Overlapping Experiments


This approach is to simply test a factor as a one-factor experiment when the factor is ready to be tested with each test being independently randomized. These tests can be going on simultaneously if there is no obvious user experience issue with the combinations that could be shown to any visitor. This is the approach you should take if you want to maximize the speed with which ideas are tested and you are not interested in or concerned with interactions. Large interactions between factors are actually rarer than most people believe. This is a much better alternative than the traditional approach mentioned first. With the traditional approach you have the limitation that you can’t test until all the factors are ready to be tested. In addition, when you’re done (with most test designs that are recommended) you won’t be able to estimate interactions either. With Overlapping Experiments you test the factors more quickly and, if there is sufficient overlap in any two factors, you can estimate the interaction between those factors. If you are especially interested in the interaction between two factors you can plan to test those factors at the same time.


Summary


I believe the two alternatives presented above are better than the traditional MVT approach. The one you would use would depend on your priorities. If you want to test ideas as quickly as possible and aren’t concerned about interactions, use the Overlapping Experiments approach. If you think it is important to estimate interactions and want the maximum ability to estimate them run the experiments concurrently with users being independently randomized into each test.


References:

 

  1. Davies, O. L., and W. A. Hay, (1950). “Construction and uses of fractional factorial designs in industrial research”, Biometrics, 6, 233.
    2. Box, George E. P., Hunter, William G., and Hunter, J. Stuart, Statistics for Experimenters, 2005, Wiley and Sons, New York.
    3. Plackett, R.L. and J. P. Burman, (1946) "The Design of Optimum Multifactorial Experiments", Biometrika, 33, pp. 305-25.