Multiple Testing, Reporting Bias, and Misinterpreting Outliers

Introduction to Multiple Testing

  • The General Problem
    • Statistical testing and p-values are great ides.
    • When we find patterns in the world, we want to know if they reflect genuine phenomena or if they could have easily been produced by random chance.
    • For any given statistical result, these tools can help us figure out if the result is easily explained by chance or not.
    • HOWEVER, we face the file drawer problem: the public don't get to see all statistical tests that were conducted or could have been conducted. We only see the ones that were published (usually are those "statistically significant" ones).
    • Testing multiple hypotheses + selectively reporting only the significant results = dangerous combination.
  • A General Phenomenon
    • Once we appreciate the problem of multiple testing and selective reporting, we start to see it everywhere.
    • We might even wonder if we know anything

Multiple Testing in Research Practice

  • p-Hacking: An analyst knows that their result is much more interesting, exciting, and publishable if the result is statistically significant. They play around with their sample, their specification, etc. until they get a significant result, and they only report that one.
  • p-Screening: Honest researchers formulate hypotheses and conduct their tests. However, the statistically significant results are more likely to be written up, published, and publicized because those results are more interesting.
  • In both cases, we can end up with lots of false positives and overestimates.
  • Giving ourselves too many chances:
    • If we give ourselves a lot of chances to reject the null, the chances are pretty good that we will get at least one p<0.05.
    • Ways to give ourselves too many changes?
      • Asking Different Questions
        • If we ask a lot of different questions and conduct a hypothesis test for each one, the chances are very good at least one will look significant.
      • Trying Different Methods
        • When we conduct the first independent data analysis, researches have to make various choices (which method to use, which variables to include, etc.). These choices can be collectively called the researcher degrees of freedom. Or more colloquially, "wiggle room".
        • Sometimes, researchers can feel tempted to peak and tweak. This need not be malicious or deliberate. We are all human and feel pressure and want to be proven right. It doesn't make us bad people, but as a scientist/analyst, we need to be on guard against this temptation.
      • Testing Subgroups
        • Instead of increasing our chances by trying different analytical methods or questions, we could just look at the same questions and methods indifferent subgroups.
      • Quietly Conducting Multiple Studies (screening)
        • Reporting/Publication Bias: scientific journals, and, consequently, the broader press, have a strong preference for statistically significant results. The practice of only publishing and reporting statistically significant results is called publication bias or reporting bias.
  • How do we know p-hacking and p-screening is a problem?
    • We know how p-values are defined, so we have an expectation about how likely any value of p would be.
      • Specifically, p-values are uniformly distributed under the null.
      • This just means, assuming no bias and the null is true (i.e. there is no effect), we expect to see p-values of 0.01, 0.02, 0.03, etc. with equal probability. \longrightarrow 20% chance of observing a p-value \leq 0.2, 5% chance of observing a p-value \leq 0.05, and 99% chance of observing a p-value \leq 0.99.

Potential Solutions to p-Hacing and p-Screening

  • The general problem of multiple testing and selective reporting in quantitative studies is very hard to fix. But here are some potential solutions:
    • Pre-registration:
      • Researchers could pre-register their studies: say exactly what they plan to test for and how; report the results of their pre-registered tests regardless of the result.
      • Researchers are using a 5% significance threshold, which means that under the null, we would expect 5% to be significant by chance.
        • That means the proportion of things being tested are actually effective is closer to 3%.
        • Even conditional on a successful trial, the chances of genuine effectiveness are about 3 in 8.
      • For pre-registration to be an effective solution, it has to be done well: we cannot just let researchers pre-register 100 tests and report the favorable ones. There's a lot of that going on now in the social sciences.
    • Replication:
      • One way to know if an estimated effect is genuine is to replicate it.
        • It is not fool proof. Suppose we use a 5% significance threshold, there is still a 0.05*0.05=0.025 chance of two significant estimates by chance.
        • And failure to reject the null is not proof of the null: so, we might wrongly reject real effects by conducting low-power replications.
        • Ideally, we would get lots and lots of data and replicate on very large samples.
      • Sometimes, replication is feasible, and sometimes it is not.
    • Change our Significance Threshold:
      • Maybe we can solve the problem of multiple testing and false-positive results by using lower significance thresholds of p-values.
        • Maybe the convention of 5% got us into some trouble.
        • If we have enough statistical power, it would be great to use lower significance.
      • However, any bright line will cause the same problems that we've been discussing, and lower p-values could actually make the problem worse in some respects.
        • Lower significance thresholds will likely lead us to overestimate actual treatment effects even more than we do today.
      • Therefore, it is better to not obsess over significance thresholds at all, rather than changing them.
    • Don't Obsess over Statistical Significance:
      • p=0.05 is just an arbitrary threshold.
        • Substantively important effect may be statistically insignificant, and
        • Statistically significant result may be substantively unimportant.
      • We could use an entirely different paradigm for reporting results, like estimation paradigm or Bayesian inference that emphasize estimating effect sizes and uncertainty over p-values.
        • Require paradigm shift (hard, unlikely)
        • Sometimes we do have to make a yes/no decision about an effect.
    • Multiple Testing Corrections
      • Every time we do a test/get another chance we require to hit a lower overall significance threshold.
      • Pro: allow a more flexible approach based on the specific circumstances of a study.
      • Cons:
        • Can we actually count the tests/chances in an accurate and trustworthy way?
        • Proper adjustment isn't always clear.
    • Test Important Hypotheses, not Cute Hypotheses:
      • If we read a study that would have never been published had they found a null result, we would be more worried about multiple testing and selective reporting.
      • If researchers test questions for which we care about the answer, even if it is null, then a lot of these problems go away.
      • Most important questions fall into the latter category, whereas a lot of cute and fun questions fall into the former category.
    • Be Skeptical:
      • Individual studies can be flawed, and we should be skeptical of the newest, cutest scientific result.
      • Our hope is that even though individual studies are flawed, scientific literatures can be valuable nonetheless after we have done lots of theory, testing, replication, and meta-analysis.
    • Reprise on p-values:
      • p-values and hypothesis testing are useful tools, but they can be abused.
      • p-values solve an important problem, so they should be used when appropriate.
      • But they are not the ultimate method for assessing the credibility of an empirical result.

results matching ""

    No results matching ""