Regression

From Potential Outcomes to Regression

  • The potential outcomes of every person in an experiment can be seen to be equal to the average potential outcomes in the population, plus a residual or idiosyncratic component: Y0i=E[Y0]+r0i Y_{0_i}=E[Y_0]+r_{0_i}

Y1i=E[Y1]+r1i Y_{1_i}=E[Y_1]+r_{1_i}

  • We can always express the observed outcome of person ii as follows: Yi=[Y1iTi]+[Y0i(1Ti)] Y_i=[Y_{1_i}\cdot T_i]+[Y_{0_i}\cdot(1-T_i)]

Yi=[(E[Y1]+r1i)Ti]+[(E[Y0]+r0i)(1Ti)] Y_i=[(E[Y_1]+r_{1_{i}})\cdot T_i]+[(E[Y_0]+r_{0_{i}})\cdot(1-T_i)]

Yi=[TiE[Y1]+Tir1i]+[(1T)] Y_i=[T_{i}E[Y_1]+T_ir_{1_i}]+[(1-T)]

Yi=TiE[Y1]+(1Ti)E[Y0]+Tir1i+(1Ti)r0i Y_i=T_iE[Y_1]+(1-T_i)E[Y_0]+T_ir_{1_i}+(1-T_i)r_{0_i}

Yi=TiE[Y1]+E[Y0]+(Ti)E[Y0]+Tir1i+(1Ti)r0i Y_i=T_iE[Y_1]+E[Y_0]+(-T_i)E[Y_0]+T_ir_{1_i}+(1-T_i)r_{0_i}

Yi=E[Y0]+Ti(E[Y1]E[Y0])+Tir1i+(1Ti)r0i Y_i=E[Y_0]+T_i(E[Y_1]-E[Y_0])+T_ir_{1_i}+(1-T_i)r_{0_i} Collecting terms, we have Yi=E[Y0]+[E[Y1]E[Y0]]Ti+[r1iTi+r0i(1Ti)] Y_i={\color{red}{E[Y_0]}}+{\color{green}{[E[Y_1]-E[Y_0]]}}\cdot T_i+{\color{purple}{[r_{1_i}\cdot T_i}+r_{0_i}\cdot(1-T_i)]}

  • We call the first term α\color{red}\alpha, the second term β\color{gree}\beta, and the third term ri\color{purple}r_i, and notice that β=ATE\beta=\text{ATE}: Yi=α+βTi+ri Y_i={\color{red}\alpha}+{\color{green}\beta}T_i+{\color{purple}r_i}
  • So, the observed outcome has a LINEAR relationship with the treatment status, where the intercept of the line is the mean potential outcome when untreated, and the slope of the line is the ATE

Regression Mechanics

Yi=α+βTi+γVi+ri Y_i=\alpha+\beta T_i+\gamma V_i+r_i

  • YY, TT, and VV are data in the world that we observe, and α\alpha, β\beta, and γ\gamma are parameter.
  • These are the quantities we would like to estimate:
    • α\alpha: the intercept or constant term
    • β\beta: the effect of the treatment
    • γ\gamma: the effect of the control variable
  • Error term/Residual, rr, which we hope.assume is uncorrelated with TT.
    • Effect of Treatment Vs. Effect of Control Variable
  • There is nothing in the equation that distinguishes the treatment variable from the control variable.
  • This distinction is conceptual and is driven by
    • Research design
    • Question
  • Typically, we don't actually care about the value of γ\gamma, and it may or may not represent an effect of interest.
  • What's important is that β\beta is an effect of interest, and we have a research design that allows us to estimate it in an unbiased way.
    • Interpreting Parameter:
  • Literally:
    • Pretend that we know the data generating process (DGP)
    • For each individual ii, thier outcome = common intercept + βTi\beta T_i + γXi\gamma X_i + random noise.
  • Linear Approximation of Relationship:
    • Acknowledge that we don't know the DGP
    • Still like to estimate β\beta as the average linear relationship between YY and TT, controlling for XX.
  • Regression always gives us the best linear approximation to the conditional expectation function (BLACEF), which may or may not be interesting.
  • If TT is unrelated to the Y1Y_1's and Y0Y_0's after controlling for XX, then the BLACEF is the effect of TT on YY.
    • Predictions and Errors:
  • When we run a regression, we are estimating the values of α\alpha, β\beta, and γ\gamma that gives us the best predictions of YY.
    • Estimates α^\hat{\alpha}, β^\hat{\beta}, and γ^\hat{\gamma}
  • Our prediction of Yi^\hat{Y_i} for each individual is α^+β^Ti+γ^Xi\hat{\alpha}+\hat{\beta}T_i+\hat{\gamma}X_i.
    • Take our estimates of α^\hat{\alpha}, β^\hat{\beta}, and γ^\hat{\gamma} and plug in TiT_i and XiX_i to get a Yi^\hat{Y_i} for each individual.
  • The error associated with each prediction can be written as YiYi^=Yi(α^+β^Ti+γ^Xi) Y_i-\hat{Y_i}=Y_i-(\hat{\alpha}+\hat{\beta}T_i+\hat{\gamma}X_i)
  • We could square this error and then sum these squares for the entire population.
  • This total squared error indicates the extent to which the regression fits the data, with a value of 00 indicating a perfect fit, and higher values indicating worse fit.
    • Fitting Regression
  • We have a measure of how well our estimates fit the data: Sum of Squared Errors.
  • We could compare two different sets of estimates and see which fits the data better.
  • More generally, we could try and fit the values α^\hat{\alpha}, β^\hat{\beta}, and γ^\hat{\gamma} that minimize this squared errors.
    • This is exactly what regression does: Ordinary Least Squares (OLS)
    • Interpreting the Equation
  • Let's focus on a simple linear regresion: Y=α+βXY=\alpha+\beta * X
  • α\alpha: intercept, or the value of YY when all XX's are 00.
    • Sometimes all XX's being zero is something meaningless, or an inappropriate extrapolation.
  • β\beta: slope, or how much the average YY changes for a 1-unit change in the affiliated XX.
  • Parameters of the regression model: unknown quantities we have to estimate using our data in order to fit our proposed regression model. In this case, α\alpha and β\beta
  • (α+βX)(\alpha+\beta *X) is the predicted average outcome for any value of the XX's
    • Accuracy of the predictions depends on the fit of the model.

Omitted Variable Bias

  • Thought Experiment:
    • TT: Treatment
    • XX: Omitted Variable
    • We might aks, hypothetically, how biased our results will be if we fail to control for XX.
    • Long regression: includes XX: Yi=αL+βLTi+γXi+ϵi Y_i=\alpha^L+\beta^LT_i+\gamma X_i+\epsilon_i
    • Short regression: omits XX: Yi=αS+βSTi+ξi Y_i=\alpha^S+\beta^ST_i+\xi_i
    • βSβL\beta^S\neq\beta^L as long as γ0\gamma\neq0 and Cov(X,T)0\text{Cov}(X,T)\neq0.
      • If γ=0\gamma=0, then there is no need to include XX in the regression.
      • If Cov(X,T)=0\text{Cov}(X,T)=0, then ξi\xi_i in the short regression is all the unexplained variation in YiY_i not captured by αS+βSTi\alpha^S+\beta^ST_i. If XX is unrelated to TT, then the unexplained variation include variation in XiX_i. So, ξi\xi_i in the short regression will simply equal $\gamma X_i+\epsilon_i$ in the long regression.
    • In other words, βS=βL\beta^S=\beta^L if $\gamma=0$ or Cov(X,T)=0\text{Cov}(X,T)=0.
  • Quantifying the Bias
    • Specifically, we can quantify the bias associated with failing to include XX in the regression. Consider Xi=τ+πTi+μi X_i=\tau+\pi T_i+\mu_i
      • This is a regression of the control variable on the treatment variable.
        • π\pi is a measure of the correlation between XX and TT - it's the slope coefficient relating changes in TT to changes in XX.
        • This need not have a causal interpretation.
      • It turns out that the bias associated with excluding VV from the regression is $\beta^S-\beta^L=\pi\gamma$. We sometimes call this omitted variable bias.
    • Omitted Variable Bias βSβL=πγ \beta^S-\beta^L=\pi\gamma
      • π\pi: relationship between XX and TT (control variable and treatment)
      • γ\gamma: relatipnship between XX and YY (control variable and outcome)
    • The short regression, leaving out XX, will be biased if
      • The control variable XX is correlated with the treatment variable TT and
      • The control variable XX influences the outcome variable YY.
  • OVB and Observability
    • If we cannot observe XX, we cannot include it in the regression.
    • The OVB equation gives us a way to think about the direction and extent of the bias:
positive π\pi Negative π\pi
Positive γ\gamma ++ -
Negative γ\gamma - ++

Skepticism and Wrap-Up

  • Deal with Confounder
    • In principle, we could collect data on each of these factors and try to include them in the regression, which would hopefully yield better estimates of the effect of interest.
    • Every extends to regressions with many control variables.
  • Reason for Skepticism
    • Regression-based causal inference is predicated on the assumption that when key observed variable have been made equal across treatment and control groups, selection bias from the things we can't see is also mostly eliminated.
    • In order for regression with controls to allow us to estimate the treatment effect, we have to make an assumption about controlling for all relevant differences.
      • This assumption is sometimes called the conditional-independence assumption or the selection-on-observables assumption
    • Two big problems:
      • How do we know we have controlled for all the relevant differences?
      • Some of the differences are unoservable (and do not have good proxies)
      • A meta-problem: we cannot test our assumptions

results matching ""

    No results matching ""