Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 77 (4), 631-662

Thou Shalt Not Bear False Witness Against Null Hypothesis Significance Testing

Affiliations

Thou Shalt Not Bear False Witness Against Null Hypothesis Significance Testing

Miguel A García-Pérez. Educ Psychol Meas.

Abstract

Null hypothesis significance testing (NHST) has been the subject of debate for decades and alternative approaches to data analysis have been proposed. This article addresses this debate from the perspective of scientific inquiry and inference. Inference is an inverse problem and application of statistical methods cannot reveal whether effects exist or whether they are empirically meaningful. Hence, raising conclusions from the outcomes of statistical analyses is subject to limitations. NHST has been criticized for its misuse and the misconstruction of its outcomes, also stressing its inability to meet expectations that it was never designed to fulfil. Ironically, alternatives to NHST are identical in these respects, something that has been overlooked in their presentation. Three of those alternatives are discussed here (estimation via confidence intervals and effect sizes, quantification of evidence via Bayes factors, and mere reporting of descriptive statistics). None of them offers a solution to the problems that NHST is purported to have, all of them are susceptible to misuse and misinterpretation, and some bring around their own problems (e.g., Bayes factors have a one-to-one correspondence with p values, but they are entirely deprived of an inferential framework). Those alternatives also fail to cover a broad area of inference not involving distributional parameters, where NHST procedures remain the only (and suitable) option. Like knives or axes, NHST is not inherently evil; only misuse and misinterpretation of its outcomes needs to be eradicated.

Keywords: Bayes factor; estimation; goodness of fit; inverse problem; significance testing.

Conflict of interest statement

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
(a) Smallest value of the (unsigned) sample correlation that will reject the nil null hypothesis in a two-sided test with α = .05, as a function of sample size. (b) Smallest value of the sample correlation that will not reject the non-nil null hypothesis in a left-tailed test with α = .05, as a function of sample size and parameterized by the magnitude ρ0 of correlation tested for (see the labels next to each curve). Sample correlations and criterion values ρ0 are assumed positive, of course.
Figure 2.
Figure 2.
Scatterplots of log Bayes factor against log p value for true (open circles) and false (red crosses) null hypotheses at four different sample sizes (panels) in a paired-samples (or one-sample) test for the mean.
Figure 3.
Figure 3.
Scatterplots of log Bayes factor against log p value for true (open circles) and false (red crosses) null hypotheses at four different sample sizes (panels) in a test for the correlation between two variables.
Figure 4.
Figure 4.
Sample data (symbols) and fitted curves from the models in Equation 2 (left panel) and Equation 1 (right panel). Insets show the value and p value of the goodness of fit statistic G2 (with 10 degrees of freedom in the left panel and with 13 degrees of freedom in the right panel), the measure of misfit given by twice the value of the negative log-likelihood of the data under each model (−2 log L), and the value of the Bayesian information criterion (BIC), which adds to the value of −2 log L a penalty of ln(15) units per parameter in the model. The two-parameter model is rejected by the G2 statistic but it nevertheless outperforms the five-parameter model by the BIC.

Similar articles

See all similar articles

Cited by 2 PubMed Central articles

LinkOut - more resources

Feedback