Tools for Online Experimentation
Power Calculator
The power calculator is most useful in planning your experiment to determine how large of a sample size you need. It is not nearly as useful after the experiment has run. Sample size calculation is not the only determinant of how long to run your experiment. Other considerations:
• Daily and weekly trends. Since there are strong daily and weekly trends in almost any online metrics an experiment should run for complete days and complete weeks if possible. For example, if your test ran for 9 full days starting on a Friday afternoon you would have 5 weekdays and 4 weekend days in your experiment. This is not representative of your users long-term because the behavior of users on the weekend may be different than during the week. This experiment would give too much weight to weekends.
• Suspected trend in the effect. If you suspect the effect of the treatment may be better or worse initially than in the long term you should run your experiment for a minimum of four weeks to get an indication of whether there is a trend or not. You may need to extend the test to get a better estimate of the long-term trend and what the asymptote of the effect may be if there does appear to be a trend in the effect.
What is power? The power of an experiment is the ability for the experiment to detect a certain size change in the average of the primary metric due to the Treatment. You can also think of it as the sensitivity of the test. This is often expressed as the likelihood, or probability, that the experiment will have a statistically significant result when the Treatment has a certain size effect.
In order to use the power calculator you need to input a few pieces of information. You need to determine how sensitive you want or need your experiment to be and you need to know something about the statistical characteristics of your data. For each of these there are two options.
1. Specifying the size of change (also known as delta) you need to detect
• In the scale of the primary metric, or OEC. For example, suppose your OEC (primary metric) is conversion rate (percent of users that convert) and you have had an average conversion rate of .08 (or 8%) in the past, you may need to be able to detect a change of +/- 0.004. So if the Treatment is 0.004 less than or greater than the Control you want the Experiment to have sufficient power to detect that.
• As a percentage of the OEC. For conversion rate, you may want to detect a change that is 5% of the current conversion rate. This is equivalent to the 0.004 above, it’s just a different way of specifying it. The power calculator allows you to specify your delta in either way.

2. Determine the type of metric for the OEC. We need to differentiate these two types of metric because different information is input to the power calculation for each. A different worksheet is set up for each of these two types of metrics.

• Binary. A metric is binary if it can take only one of two values for each measurement opportunity. For example, in the calculation of conversion rate a measurement opportunity is a user/visitor. Either the visitor converts or not so conversion rate is a binary metric. The information you will input in this case is the proportion of 1’s (or 0’s) expected in the Control. In most cases this will just be the current conversion rate.
• Non-binary. If a metric takes more than two values for each measurement opportunity. Examples of this would be number of units purchased per user, dollar amount spent per user, clicks per user, etc. The information you will input for this type of metric is the standard deviation for a recent week. If you are inputting the delta as a percent of the Control you will also need to input the average for a recent week.
 Tool Description Requirements PowerCalc2 New simplified calculator handles binary and non-binary metrics and provides standard deviation range estimates for cases where the s.d. is not known Excel 2007 or 2010 POWER CALCULATOR Power Calculator for Excel 2007 Excel 2007