The Fallacy of Significance Tests

It must be known that a p-value, or any other statistics based on the Chi-Square, is not a useful number. It has two components : sample size and effect size. Its ability to detect a non-zero difference increases when either sample size or effect size increases. If only sample size increases, even with the other left constant, the statistics become inflated. There is also a problem with the assumption. If it is about the detection of “non-zero” difference, it is of no use if the magnitude, i.e., effect size, is of no importance. I will provide several examples of the dangerosity of the significance tests.

If one take a glance at google, the web is replete with websites, chapters, presentations with how to interpret and report the p-value. They all have their share of troubles. Each time the authors present a result with an effect that may be small, modest, or large, but clearly different from zero, they constantly ignore it and go to examine the p-value. If it does not reach significance (inferior than 0.05 or sometimes 0.10) then whatever the effect size is, they conclude there is no difference or no correlation. The correct interpretation should have been to say that whatever the effect size we find, the non-significant p-values suggest we need more samples to have more confidence in our results. A related problem is the cut-off level used for significance. There is no logical reason to affirm that 0.04 is significant but 0.06 is not. If we need some indices of “confidence” then the so-called confidence intervals (CIs become larger when sample is small) would have been largely sufficient and much better. Although even in this case, the CI may still have problems since it has been advanced that if the CIs include zero, then we must conclude there is no effect different from zero. Well, this is just based on the same old-fashioned fallacy.

Rogosa (1980) provides an enlightning illustration of the consequences of this utter fallacy. In the CLC path analysis framework, it is said that when the cross-lagged correlations do not show any causal dominance whatsoever, the difference in the (cross-lagged) path correlations must not be significant. However :

Rejection of the null hypothesis of equal cross-lagged correlations (H0: p(x1*y2) = p(y1*x2)) often is interpreted with little regard for the power of the statistical test. Users of CLC are advised to use large samples; Kenny (1975) advises that “cross-lagged analysis is a low-power test” (p. 887) and that even with moderate sample sizes (defined as 75 to 300), statistically significant differences are difficult to obtain. With large enough samples, trivial deviations from the null hypothesis lead to rejection. For example, Crano, Kenny, and Campbell (1972) found significant differences between cross-lagged correlations of .65 and .67 because the sample size was 5,495.

And that’s how significance test is used. To produce misleading conclusions. Concerning the scientists having claimed a “significant” difference that is ridiculously small in effect size, one relevant question is : do they really believe what they say ? Sometimes, I doubt. They don’t have the guts to question the gold standards, as if they are the words of the God.

Other dangerous claims come from studies aimed to detect item bias. Especially, some early DIF studies did not even reported effect sizes of the DIFs. This is what happened in Willson et al. (1989), among quite many others in the 1980s, where they claim they found no black-white item bias in the K-ABC, based on significance tests, and yet this is not surprising since their sample is small (N=100). With regard to the few significant DIF items found, they note that “the effects, although statistically significant, tend to be of no real or practical consequence.” (p. 295). The first problem is with the interpretation. Do they mean no practical consequence in terms of their individual effect sizes ? Or no practical consequence in terms of the impact of the whole set of DIFs on the total test score ? The other problem is that a large amount of DIFs has probably been missed due to low power of detection. There is no way we can tell if the undetected DIFs would show a pattern of DIF cancellation, which would be evidence of no bias. A study of this kind should have never been accepted for publication.

Still another illustration of the insidious effect of relying on significance test, van Soelen et al. (2011, Table 5) claimed that the childhood heritability of PIQ is 0.64 in ACE model instead of 0.46 in AE model. Usually, the purpose in such modeling is to find the most parsimonious, i.e., simplest model, having the least free parameters to be estimated. When a parameter is removed, and that model fit is not worsened, it is said that the reduced model is acceptable, compared to the full model. Thus, the reason they select the AE estimates is because the removal of C parameter (shared environment) does not reach significance, given Chi-Square statistics. Their sample size was modest (224+46). The problem is that C has a value of 0.17. Modest but not zero. In AE model, where C is dropped, the C value becomes obviously zero. Surely, with a statistics less impacted by sample size, or in sample size larger, the result will be different. In the ACE model, A amounts to 0.46, C to 0.17 and E to 0.38. The total is 1.00, as usually the case with standardized parameters, which must totalize 100%. Now, if we look at AE model, A equals 0.64 and E equals 0.36. What happened ? Simply. When C has been dropped, its value is given to A, which becomes inflated. E being the nonshared environment + measurement error. This distortion has serious implication for their conclusion, where they imply that, based on AE models, heritability for PIQ does not increase with age when in fact it has probably increased.

The consequences of all of these mistakes can be best understood when one is reading an article that reviews previous research. The author(s) begin to say that researchers X find no relationship between A and B, researchers Y found no relationship between A and B and C. When we look at the referenced articles, however, it sometimes happens that the claim about the null relationship is due to the authors focusing on p-value instead of effect size, which can be as low as 0.10 or as high as 0.30. Clearly, it’s not equal to zero. Even worse comes when they summarize numerous studies each with small samples, all of them having no relationship. And yet, when the samples are combined, the p-value will be highly significant. This is what happened with Besharov (2011) who constantly rejects every experimental studies that fail to improve IQ but improve scholastic achievement. This conclusion is still right, but the way Besharov relies on significance tests not only obscures the report of effect size but also the largely “significant” difference that will emerge when considering all the studies collectively. Since we don’t have necessarily a lot of time to read all of these papers, one would easily prefer to trust the article review. Unfortunately, it may go wrong. There is no way to know until we read all of these papers ourselves.

Indeed, one must wonder first of all if the significance test really adds any relevant information. An effect size that is small can be easily disregarded if the sample is small. We will conclude it needs more replication, without even looking at the p-value. My opinion is obviously that significance tests should never be used again. It does not add any new information above what is provided by sample size and effect size. It only adds confusion.

Further reading:

Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management, 912-923.
Dahiru, T. (2008). P-value, a true test of statistical significance? A cautionary note. Annals of Ibadan postgraduate medicine, 6(1), 21-26.
Grabowski, B. (2016). “P< 0.05” might not mean what you think: American Statistical Association clarifies P values.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology, 31, 337-350.


  1. John Wayne

    Maybe you will be happy to know that in a revise and resubmit of my last manuscript I was asked to remove the significance tests and just exhibit the correlations.

  2. Meng Hu

    Good to hear. But well… in general I don’t understand this fashion for “significance” anyway. It’s because i don’t trust scientists anymore that I am feeling obliged to read the articles they cite in their article review. Even Arthur Jensen, a guy I considered once to be a super psychometrician, cited many studies with fallacious claims and interpretations of them (unless he has misread all of them). I have, indeed, lost faith even in some very great names. Now I want to make my own review(s), not to have to rely on their expertise. This is unfortunate because the main (and great) advantage of such articles (and books) is because you don’t have lot of free time. You certainly don’t want to read the 100-200 articles cited in the paper or book. Yet, that’s the only option I have…..

  3. Emil OW Kirkegaard

    Banning p-tests outright is a little much. But we can make sure that papers in OpenPsych don’t get it wrong.

    I also mentioned an amusing example recently, where a paper had p=.04 and p=.06, and the authors concluded that group X and Y were different on attribute A, but not B, based on the miniscule difference in p values. I don’t know how that got through peer review.

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2024 Human Varieties

Theme by Anders NorenUp ↑