How to Set Up and Run an A/B Test

Testing is one of the best ways to optimize your email campaign but it can also be daunting to figure out. Not only do you need to understand how to set up and run a test, but also know what to test, how long to run the test, and what it means when a test is statistically significant—or what statistically significant even means. Even after all that is done, that is merely the tip of the testing iceberg.

There are many different types of tests you can do to optimize your email program, but in this post we will cover the steps needed to run the most common one: A/B testing (sometimes called split testing).

Before you begin testing, it's critical to start with a hypothesis—knowing what you are testing and what the desired outcome is. A good hypothesis can make your testing a breeze, because the testing analysis will be quicker and easier with a discernible outcome. On the other hand a bad hypothesis can make testing a nightmare. The hypothesis helps decide how large your test and control groups need to be, how long you need to run the test, and whether a test has conclusive results or not.

The possibilities for testing are endless, but one test you can run starting today is a subject line test. If you don’t have subject lines to test, use Return Path’s Subject Line Optimizer to help create new subject lines for testing. The basic steps for a subject line test would look like this:

1. Create a hypothesis
For example: The new (test) subject line will increase read rate by five percentage points. In this hypothesis you have stated that you are basing the success of the test on read rate, and the increase needed for that test to be successful is five percent. This ensures you will not only be able to successfully choose the size of your test and control groups, which will help you determine whether your test results are statistically significant, but also clearly defines whether the test was successful (whether or not stated the goal was reached).

2. Use your hypothesis to create a random sample for test and control groups
In our test, we are testing read rate and will therefore use that historical rate to decide how many people need to be in the test and control groups. Say you have a historical read rate of 12% for the subject line you are testing. Then you have to decide the minimum difference you want to be able to detect. We want to be able to see an increase of five percentage points, but do we want to see if a smaller change is statistically significant? The smaller the change you want to detect, the larger your sample size will need to be. Say we want to detect a change of two percentage points or more (+/- 2%), then you can use Return Path’s Sample Size Calculator to decide the size your test and control groups are.

For a list of 100,000 people with a confidence level of 95% and a 2% margin of error, the size of both your test and control groups will be 1,936 people. Next, you need to create a random sample of 1,936 people for both the test and control groups, and double check to make sure the read rates of the two random groups are the same.

Fun fact: according to the Central Limit Theorem two completely random groups will have the same rate, assuming the rate has a normal distribution.

3. Understand statistical significance for your groups
The question of whether the results are "statistically significant" is really about determining whether a test’s results are meaningful. How do you know if your test results are statistically significant? The best way is to have a properly stated hypothesis and a correct sample size for the test and control groups.

Continuing with the example of our test, if your control read rate is 12% and the test read rate is 13%, the test is not statistically significant and your results would be inconclusive even if one group performed better than another. If the control read rate is 12% and the test read rate is 15%, the test is statistically significant and the test read rate performed significantly better than the control read rate.

4. Determine how long to run the test
This is a tricky question that requires both domain expertise and statistical knowledge. Performing this type of A/B test on email is more akin to the traditional A/B testing from the past and less like the the newer A/B testing that you may be doing (for example, impression type ads on webpagesmore on this in a later post). This means, rather than waiting to see if you have enough impressions to make a decision, you decide the sample size beforehand (how many emails to send). Therefore, the number of impressions don’t determine how long the test is, but the actual rate does.

If you have an average read rate of 12% on the first day and you are trying to increase that read rate by five percentage points by changing that rate, there are three possible results: failure, success, or inconclusive by the next day. One note, other tests such as frequency are run differently because frequency has a long term effect, meaning after people adjust to the new frequency, so the question becomes whether they behave the same or differently after they have had time to adjust to the new frequency.

5. Compare results
Say we ran our subject line test. We first compare rates and see if they are statistically significant using the sample size information (as we did above). Then using our hypothesis, based on the results from the significant testing, we decide whether our test was a success, failure, or inconclusive. 

What does a “success,” “failure,” or “inconclusive” result mean for your company goals? What if our test only improved subject line read rates by three percentage points? Even though three percentage points is still statistically significant by your test hypothesis, the test would be a failure because three percentage points is less than your stated goal of five percent. Is the three percentage point increase a large enough change to push that new subject line to everyone? If it is, should the hypothesis be restated?

6. Draw conclusions and move forward
If our test is a success, what now? Now we can push it to the entire list. But does that mean that you are done? Not even close! One of the best and worst things about testing is that you can always do more. Something important to remember, especially in the case of subject lines that even if a test is successful, that subject line success may decline over time as you get new subscribers, seasons or trends change, etc. Therefore, if you are constantly testing and trying new things, you will always be able to find new ways to push your ROI higher.

Check back soon for then next blog post in this series!

minute read

Popular stories



BriteVerify email verification ensures that an email address actually exists in real-time


The #1 global data quality tool used by thousands of Salesforce admins


Insights and deliverability guidance from the only all-in-one email marketing solution

GridBuddy Cloud

Transform how you interact with your data through the versatility of grids.

Return Path

World-class deliverability applications to optimize email marketing programs

Trust Assessments

A revolutionary new solution for assessing Salesforce data quality


Validity for Email

Increase inbox placement and maximize subscriber reach with clean and actionable data

Validity for Data Management

Simplify data management with solutions that improve data quality and increase CRM adoption

Validity for Sales Productivity

Give your sales team back hours per day with tools designed to increase productivity and mitigate pipeline risks in real-time