# Matching for Non-Random Studies

Thu, Jul 25, 2019

onExperimental designs such as A/B testing are a cornerstone of statistical practice. By randomly assigning treatments to subjects, we can test the effect of a test versus a control (as in a clinical trial for a proposed new drug) or can determine which of several web page layouts for a promotional offer receives the largest response. Designed, controlled experiments are a common feature of much of scientific and business research. THe internet is a natural platform from which to launch tests on almost any topic, and the principles of randomization are easily understood.

Unfortunately, not all data are collected in this manner. Observational studies are still commonplace. In medicine, many published results rely on the observation of patients as they receive care, as opposed to participating in a planned study. On the internet, offers are often sent to whitelists which are made up of non-randomly selected potential customers. And in manufacturing, it may not always be possible to control environmental variables such as temperature and humidity, or operational characteristics such as traffic intensity.

The fact that treatments are not randomly assigned should not necessarily mean that the data subsequently collected is of no use. In this note we will discuss ways that researchers can adjust their analysis to account for the way in which treatments were assigned or observed. We begin with a study of men who were admitted to hospital with suspicion of heart attack. There were 400 subjects, aged 40-70, with mortality within 30 days as the outcome. The treatment of interest in this case is whether a new therapy involving a test medication is more effective than the standard therapy. The issue is that the new therapy was not applied randomly, but rather that patient prognosis was a partial determinant in which treatment was given.

The raw data shows a lower mortality rate for the treatment group:

Alive | Dead | |
---|---|---|

Control | 168 | 40 |

Test | 165 | 27 |

This corresponds to an odds ratio of 0.687, suggesting lower overall
mortality for the treatment group (although a test for the hypothesis of
equal mortality rates does not reject at *α* = 0.05). The odds ratio in
this case is the ratio of the odds of death given the test relative to
the odds of death given the control; values less than 1 show a stronger
likelihood of survival for test subjects relative to control subjects.

However, as suggested above, it does not appear that the treatment and control are completely similar in terms of their covariates, as might be expected in a fully randomized design - it appears that older subjects and those with higher severity scores (a clinical rating of disease severity) received the test at higher rates than the control:

Quantification of the treatment effect is at least partially confounded
with some of the underlying conditions that correlate with increased
expected mortality. So how to proceed? Regression (in this case,
logistic regression) is typically used to measure the effects of a
variable conditional on levels of the other variables, adjusting for
inequities in the distributions of the explanatory factors. Using a
logistic regression in this case, with a binary variable for treatment
(0 = Control, 1 = Test) and also linear variables for Age, Serverity,
and Risk Score (another prognostic assessment), we find that the odds
ratio for treatment vs control drops to 0.549, and is significantly less
than 1 at *α* = 0.05, with a 95% confidence interval of (0.31, 0.969).

However, this is an observational study, and to be useful, we need to
have more evidence that the effect of the test treatment can be measured
cleanly apart from the other variables AND from any (possibly
unintentional) selection biases that might arise due to the way
treatments were assigned to patients. That is, we need an unbiased
estimate of treatment effect, i.e., the effect of the test treatment
have if applied to the entire population of interest. To do this, we
need to introduce the concept of a *counterfactual*.

### Propensity Scores and Other Matching Methods

In a perfect world, we would be able to measure the effect of both the
treatment and control on each subject. In most cases, including our
example on heart attacks, such a measurement is not available and not
possible. If we denote the response to the test treatment as
*Y*_{1} and the response to the control treatment as
*Y*_{0}, we see that only one of these is observable – subjects
that get the treatment correspond to *Y*_{1} and subjects that
get the control correspond to *Y*_{0}. For each case, the
unobservable outcome is called a counterfactual, a conceptual quantity
that does not exist. If we use Z to indicate which treatment was applied
(*Z* = 1 for test and *Z* = 0 for control), then the observation Y can
be expressed as the sum of an observation and an unobserved
counterfactual:

*Y* = *Z**Y*_{1} + (1 − *Z*)*Y*_{0}

For each subject we are interested in *Y*_{1} − *Y*_{0},
the difference between the observation and the unobserved
counterfactual. When looking at the target population as whole, we are
likely interested in the *average treatment effect*:

*Δ* = *E*(*Y*_{1} − *Y*_{0})=*E*(*Y*_{1})−*E*(*Y*_{0})

where *E* denotes expectation or average. Unless the treatment
assignment is independent of the other factors, estimates for *Δ* might
be biased. Matching and other approaches known as *propensity scores*
are among the techniques that can be used to make the assignment of test
and control appear more random.

A propensity score is an estimate of the “probability” that a subject
gets assigned to test or control. If assignment is completely random,
then the propensity score is simply 0.50 for all subjects (assuming
equal sizes go into test and control). It can be shown that if all the
characteristics used to determine both *Z* (assignment to test or
control) and (*Y*_{0}, *Y*_{1}) are known (e.g., age,
severity), then partitioning on this set of *confounders* *X* will allow
us to develop an unbiased estimate of *Δ*.

One form of partitioning is straightforward *matching* of test and
control subjects together when they share the same values of all
confounders. Matching may be easy to do when there are only one or two
confounders, but rapidly gets harder and the number of potential
confounders increases. A more flexible approach is propensity score
matching – it can also be shown that partitioning on propensity scores
*p*(*x*) are as good as partitioning on the raw *X*-variables
themselves, as long as all subjects have a change of being selected for
both test and control.

In our example, what does this imply? To obtain propensity scores, we
can build a logistic regression model with response variable *Z* (the
probability that a subject is assigned to the test group). This gives us
estimated propensity scores $\hat{p}(x)$. Doing this in our example
finds that age, severity, and risk index all are significant at
*α* = 0.05 for predicting the treatment assignment.

There are a variety of approaches we can take at this point. One is to use the estimated propensity scores to match test and control accounts, as if we had conducted a randomized matched pairs design that assigns the test treatment to one subject in each pair. Doing this we get an estimate of the odds ratio of 0.511, with 95% confidence limits of (0.268, 0.973), only slightly lower than that found via logistic regression.

A second approach is to use the propensity score to weight subjects and proceed as if we were doing a randomized design with weights of treatment assignment given by the reciprocal of the $\hat{p}(x)$. This approach gives an estimated odds ratio of 0.567 with confidence limits of (0.301, 1.064) - again very close to that obtained by the original logistic regression.

It is important when doing any of matching or score adjustments to make sure that the adjusted groups (test and control) resemble each other as much as possible. To determine the effect of the matching, we look at each variable in turn regressed against both treatment and propensity score. In the following plot, filled circles show the t-statistics for each treatment difference for each of the confounder variable before adjusting for propensity score, and the open circles after adjustment. The propensity score appears to have made the test and control groups look much more similar to each other than before.

Summarizing the results, we see that all the adjustment methods give
similar results in terms of measuring the effect of the test treatment.
It is often the case that logistic regression, which produces an
estimate of the *conditional* effect of treatment given the confounding
variables, is very similar to the adjusted analyses that produce
estimates of the *marginal* effect of the treatment on the population.
Whenever there is interest in estimating the effects of treatments, it
is recommended that a propensity-based analysis be conducted to ensure
that potential biases due to non-random design are mitigated to the
highest degree possible.

Estimate | Lower CI | Upper CI | |
---|---|---|---|

Unadjusted | 0.687 | 0.403 | 1.171 |

Log. Regression | 0.549 | 0.310 | 0.969 |

Matched cases | 0.511 | 0.268 | 0.973 |

Propensity-weighted | 0.567 | 0.301 | 1.064 |

### Acknowledgements and Resources

This work is based on part on a modified version of an analyses by Ben Cowling (http://web.hku.hk/~bcowling/examples/propensity.htm#thanks).

To read more about propensity scores and matching, see Dehijia and Wahba (http://www.uh.edu/~adkugler/Dehejia&Wahba.pdf) and Rosenbaum and Rubin (http://www.stat.cmu.edu/~ryantibs/journalclub/rosenbaum_1983.pdf).

**Matchit** is an R package that implements many forms of matching
methods to better balance data sets
(https://gking.harvard.edu/matchit).