Experiments

Randomization and Noise

Random Assignment
- If units are assigned randomly
- Then no reason to expect differences: T/C groups should have same characteristics, including potential outcomes (in particular, $Y_0$ and $Y_1$ ).
- Formally, Treatment and Control groups should have the same characteristics in expectation.
  - But expectation is a population concept
  - And all we can see is a (finite-size) sample
Why do we do an experiment?
- Fundamental problem of causal inference:
  - Never observe both $Y_1$ and $Y_0$ for a unit
- Instead, the best we can do is compare $Y_0$ among one group to $Y_1$ in another group: $E[Y_0\mid T=1]\neq E[Y_0\mid Y=0]\rightarrow\text{Bias}$
- How do experiments help? $E[Y_0\mid T=1]=E[Y_0\mid Y=0]\rightarrow\text{No Bias}$
- How do experiments achieve this?
  - Random Assignment
Standard Errors:
- Standard error depends on $\text{Cov}(Y_1,Y_0)$ $Cov (Y^{ 1 }, Y^{ 0 })$
  - Not observable (just like $p$ - the real porportion -was not)
  - Some subtleties to the calculating $SE$
- One common approach: $\hat{SE}=\sqrt{\dfrac{\hat{\text{Var}(Y_0)}}{N-m}+\dfrac{\hat{\text{Var}(Y_1)}}{m}},$ where $N$ $N$ is the size of sample, $m$ $m$ is the number of observations assigned to treatment, and $\hat{\text{Var}(V_0)}$ $ Var (V^{ 0 }) ^ $ and $\hat{\text{Var}(V_1)}$ $ Var (V^{ 1 }) ^ $ are estimated in sample.
  - Larger sample (increase $N$ ) leads to lower standard error, meaning more precision
  - If variance of $Y_1$ and $Y_0$ (treatment and control group outcomes) are similar, then splitting sample evenly between treatment and control minimizes $SE$ .
  - Lower variance in outcomes means more precision.
    - If we are looking at samples or outcomes with less natural variable (noise), then we will have more precise inferences.

Confidence Intervals

Point Estimates: The single statistic we get from a study or analysis is called a point estimate.

Using our calculated SE, we can construct an interval of possible true values that would be consistent with our estimate. Confidence Interval (CI): a range representing our uncertainty about our point estimate due to random error/noise.

We can consider CI as providing a range of possible true values that would be consistent with our estimate.

"How consistent" is defined by the width of a CI - most commonly choose to be 95%.

95% Confidence Interval: assuming no bias, 95% of the intervals constructed in this way around an estimate will contain the true value.

Common misinterpretation: there is a 95% chance the true value lies within our CI.

Philosophical explanation: the truth is fixed: it's either in the interval (100%) or not (0%)

Mathematical explanation: if we assume that our estimand could be literally anything, and all possibilities are equally likely, the two are numerically equivalent.

CI relies on estimated SE from the sample, which may differ

"95%" is a statement about the long-run "coverage" of Ci, not the likely range of the truth itself.

Constructing Confidence Intervals:

Central Limit Theorem: If we repeat a study/analysis/experiment a zillion times, the value of many of the statistics we get will follow a normal distribution.

Normal distribution: values of some random variables (like heights of people/means of samples) follow a normal distribution

The truth will be the mean of the normal distribution

In a normal distribution, 95% of the probability lies within $\approx$ 2SEs of the mean

meaning 95% of our samples should lie within 2SEs of the truth

so if we add $\pm$ 2SEs to each side of our estimate, 95% of the time that interval should contain the truth

Common use in short: $95\%\text{ CI}=\text{Point Estimate}\pm2\times\text{SE}$

Researchers often report a 95% confidence interval for various quantities (means, regression coefficients, proportions) the same way: $[\text{Estimate}-2\times\text{SE},\text{Estimate}+2\times\text{SE}]$

What contributes to high SE?

Small sample size

High variance in outcomes $(Y_0,Y_1)$ .

Hypothesis Testing Introduction

Justification: why we need hypothesis testing?
- Sometimes, we want to make a decision about whether we observed data that are different from some important values.
Hypothesis Testing:
- Suppose that there is not effect of the treatment watsoever. I.e., $E[Y_1-Y_0]=0.$
- What is the probability that we would have gotten an estimate as big as ours just by chance alone?
  - We will call this number the p-value.
  - If that number (p-value) is really small, then we can be fairly confident that the treatment really does have an effect (because it is very unlikely that we would have gotten this result just be chance).
  - If the number is big, we are much less confident (because there is a good change that we could have gotten this result by change, even if there is really no effect).
- Test our data to see if it is consistent with the null hypothesis or not.
  - consistnet: are our results "too weird" for the null hypothesis to make senze?
  - consistent: what is the chance (porbability) we would get a result at least as "extreme" as what we observed due to random error/chance/noise, assuming the null hypothesis is true and there is no biase? -- p-value
P-Values and Magical Thresholds
- Special cutoffs:
  - "reject the null hypothesis" of no effect if $p<.05$
  - "fail to reject the null hypothesis" of effect if $p>.05$
- When results are represented in a table, stars (*, **, ***) next to the estimates are used to indicate that there are some "statistical significance."
- p-value is a very useful statistic, but it's dangerous to stress about particular significance thresholds.
- Remember that p-value means the odds of getting the same result if the true effect is zero
  - does not mean that the finding is true with probability $1-p$ .
- The naive rule of thumb that $p<.05$ means that the finding is true and $p>.05$ means that it is false has done a lot of damage over the years, and we make lots of errors on both sides of that threshold.
Calculating p-values:
- Analytical Approach: CLT and SEs
  - CLT tells us that with sufficiently large sample, the distribution of our estimates (from theoretical replications of an anlysis) approaches a normal distribution
    - Thus, we can assume our single result was drawn from a normal distribution with a mean defined by the null hypothesis ( $Y_1-Y_0=0$ ) and standard deviation equal to our sample SE.
    - Ask: how many SEs away from the null hypothesis is our estimate?
  - Take away: Large effects (differences between null and observed data) and/or low SEs/high precision for bigger $N$ , and thus lower p-values.
- Computational Approach: Permutation Tests
  - Condition: we have a bunch of data with a treatment value and an outcome value.
  - Key point: This is akin to saying our null - that the treatment doesn't affect the outcome - is true since the outcomes will be distributed randomly in each exposure/treatment group.
  - Do this, say 1000 times (permutations). Simulating 1000 experiments in a world where the null (no affect of exposure on outcome) is true.
  - Compare the distribution of these estimates to our observed estimate:
    - p-value: proportion of permutations that yield an estimate "more extreme" than your observed estimate
Summary:
- Everything has noise: even randomized experiments
- Quantify with SE: lower with larger $N$ , less outcome variance
- Use SE to calculate confidence interval (CI) - range of values of truth that would be consistent with the estimate
- May also calculate p-value to compare the estimate to a single hypothetical value of the truth (the null hypothesis)

Types of Experiments

Experiments = Random Assignment
- Random assignment of units into treatment group and non-treatment group (the control group)
- Multi Arm Experiments: multiple treatments in a single experiment
Observational studies = non-random assignment: based on observed treatments, outcomes, and behavior in the world.
Stratification and Randomization
- Think of randomization as a computer selecting a random number with treatment allocation based on the number
  - Lowest half of the random numbers -> treatment
  - Upper half of random numbers -> Control
- Sometimes, randomization is done by observable characteristics like gender to ensure balance -> Stratified randomization
  - Stratified Randomization: split subjects/units by one or a few factors, then randomize withint those strata
    - Done to ensure better balance on stratification variables
    - Usually choose expected strong confounders for stratification
    - Maximize efficiency for studying how effect varies across group
Types of Experiments
- Lab: classical model of experiments and include bench experiments (Physics, Chemistry, and Biology) and social Laboratories (Psychology, Sociology, Economics, and Political Sciences)
- Field: similar to lab experiments, but instead of being carried out in a lab, experimenters manipulate a treatment in the real world.
  - Often called a Randomized Control Trial (RCT)
- Lab in the Field
- Quasi or Natural Experiments:
  - Sometimes, institutions in the world create assignments that are random or close to random
Why do experiments?
- Randomization solves problem of causal inference
- Rule out reverse causality: experimenter know that random chance controls treatment
- Rule out common cause (confounding, omitted variable)
  - Randomness means there should be no differences between groups
  - This is true on average, so sample size and variance matter.
Randomization Eliminates Confounding
- If exposure is randomized, and therefore dependent only on chance, and therefore on average (in large samples) treatment and control will be very similar, then the groups' counterfactual outcomes shouldn't vary on average. That is, $E[Y_{0i}\mid T=0]=E[Y_{0i}\mid T=1]\quad\text{and}\quad E[Y_{1i}\mid T=1]=E[Y_{1i}\mid T=0].$
- Then, we have apples-to-apples comparison
- Have no (bias due to) confounding
- Treatment and control groups are exchangeable (can stand in for) one another
- Treatment and control groups are balanced on all convariates/characteristics
- End Result: can get unconfounded ATE (or ATT)

Treats to Internal Validity

Fundamental Principle of Controlled Experiments: Actual Outcomes among the control group can stand in for (i.e., are the same) counterfactual outcomes of the treated group

$Y_{1i}-Y_{0i}$ is the individual causal effect - Never observed directly

In an exmperimental study, we let the outcome of the control group substitude fro this unobserved outcome.

This works because randomness implies $E[Y_0\mid T=1]=E[Y_0\mid T=0].$

The concept of Internal Validity is about assessing how close $E[Y_0\mid T=1]$ is to $E[Y_0\mid T=0]$

Things can go wrong in an experiment

Chance imbalance

If assignment between control and treatment is random, then we would expect that on average the groups should be the same on observable characteristics.

It won't be exactly the same since by chance we expect some difference

With large samples we would expect more similarity between control and treatment

We cannot check the unobservable, but if there are balances in observable then we can be reasonably confident unobservable are balanced as well.

Balance Checks at the end of research papers

Randomization guarantees balance of $Y_0$ , $Y_1$ , and all pre-treatment covariates in expectation. However, imbalances occur by chance from time to time.

We might find significant imbalance in a pre-treatment covariate that we expect is strongly related to $Y_0$ and $Y_1$ .

Ex-ante: anticipate import confounder, stratify on them

Create categories and then randomize with categories

Ex-post: 3 different responses:

Throw out the broken experiment.

Had good intentions when running the experiment, but got unlucky and now cannot trust the estimates.

If we test enough covariates, we are virtually guaranteed to find imbalance on some of them.

Therefore, by logic, we have to throw out all experiments.

Even broken and imprecise experiments contain some information.

They are unbiased in expectation, and they could be pooled with other evidence, incorporated into a meta-analysis, etc. that will later contribute to knowledge.

Proceed as normal.

Unbiasedness is a property "in expectation"

Report the imbalance in the paper, but still report the same estimate that originally planned

Condition on that $X$ variable in order to account for the imbalance and get a more efficient estimate.

Compare the experimental effects within groups sharing the same level of the unbalanced observable.

Lack of statistical power

Statistical power: the ability to detect a true effect if one exists.

The smaller the effect size relative to other sources of variance, the more observations we need to conclude there is an effect at any p-value threshold.

Less noise -> more precision/power

Larger $N$ -> more power

If a study is too small, it may be underpowered to detect small-to-moderate causal effects. That is, they'll be swamped by noise and won't get a significant result.

Non-compliance

Subjects will fail to take the treatment

Untreated subject will find a way to take it anyway.

Define $Z$ as the experimental condition (assignment to treatment) and $T$ as the actual treatment of interest.Randomization guaranttess that $E[Y_0\mid Z=1]=E[Y_0\mid Z=0]$ , but when there is noncompliance, it does not say that $E[Y_0\mid T=1]=E[Y_0\mid T=0]$ because $Z$ is random but $T$ might not be.

Average effect of $Z$ for the whole sample: intent-to-treat effect (ITT).

Placebo effects

Individuals assigned to treatment often know they are part of an experiment, they know they will be recorded to see how they are doing.

We need a good logical reason to believe that these incidental things could cause the observed effect.

Placebo group in a double blinded experiment: randomly select a third group that gets everything but the actual treatment

Three groups: treatment, control, placebo

If the effect is really due to the treatment, should see it in the treated group and not the placebo group.

Measure (Treated - Control), (Placebo - Control)

True effect is (Treated - Control) - (Placebo - Control)

Most often: just compare treated to placebo

Attrition

Suppose that some units drop out of the analysis after randomization, and we cannot observe their outcomes.

If we are confident that the attrition is random, then we can still estimate the averate effect.

If the attrition is non-random but unrelated to the treatment, then we can estimate the average effect of the treatment for those that stayed in the sample.

Most of the time, we are left worrying that attrition was nonrandom and influenced by the treatment.

The biggest challenge of attribution is that when people drop out, we don't know what type of drop out they are.

Interference

For field experiments: is it realistic to assume control and treatment are not in contact?

Two forms of interference: Contamination and Spillovers.

External Validity

External Validity: to what populations, settings, treatment variables and measurement variables can this effect be generalized?
- Internal validity concerns the design and outcome of an experiment
- External validity concerns if the effect can be generalized to other settings.
Threats to external validity:
- Experimental Effect
  - People are responding to experimenters' implicit demands/hopes
  - Experimenters believe they are testing $T$ , but subjects are responding to $Z$ .
- Heterogeneity
  - Population in experiment has different (heterogeneous) response than population in general
- Adaption
Improving external validity:
- Make sure experimental population is representative, or at least representative of the populations/paces where we might want to generalize the effect to.
- Look at subpopulations to look for heterogeneous effects
Ethics:
- Quest for external validity leads us to want to match experimental conditions as closely as possible to those in the real world. However, we might not always want to match real world conditions because that would involve causing harm.

Ethics

Obvious Case: First, do no harm
- strictly prohibited to give subjects a disease/condition in order to study treatments
- More obvious case: strictly prohibited to give subjects a disease in order to study treatment or disease progression without consent.
Causing Harm
- Beyond medicine, any treatment that involves causing direct harm to treated subjects is clearly unethical.
Gray Areas:
- Side effects
  - Justifiable for treatments known to work.
  - But if we are experimenting, we don't know they work -> require an initial phase of experimentation in vitro or on animal subjects to establish reasonable belief of efficacy.
- Placebos:
  - If treatment works, is withholding it from the control group unethical? -> require continuous monitoring, stop experiment if it's obviously effective.
- Other:
  - Subjects are exposing themselves to risk
  - Conditional Cash Transfers: why not divide the money and give that to the control group
  - Teacher quality: What is our role in determining who gets a good or a bad teacher?
Moving to Opportunity Study:
- Risk-Benefit Comparison: risks to subjects are reasonable in relation to anticipated benefits, if any, to subjects, and the importance of the knowledge that may reasonably be expected to result.
- Informed Consent: no investigator my involve a human being as a subject in research unless the investigator has obtained the legally effective informed consent of the subject. An investigator shall seek such consent only under circumstances that provide the prospective subject sufficient opportunity to consider whether or not to participate and that minimize the possibility of coercion or undue influence.
  - Possible that providing full information to subjects might make the research impossible.
  - The participant is not fully informed of what the purpose of the study is when they consent.

Module 6 Experiments

Experiments

Randomization and Noise

Confidence Intervals

Hypothesis Testing Introduction

Types of Experiments

Treats to Internal Validity

External Validity

Ethics

results matching ""

No results matching ""