Bifactor Model Investigation of Spearman's Hypothesis

I’ve been catching up on recent research on psychometrics, behavioral genetics, race differences, and so on. I’ll be posting some comments on papers I found particularly interesting. The first is Frisby and Beaujean’s study of Spearman’s hypothesis.

Whites tend to outscore blacks by a wider margin on cognitive tests with higher g loadings. This is usually called Spearman’s hypothesis (the Spearman-Jensen effect would be a better name). It has traditionally been investigated using the method of correlated vectors (MCV) where tests’ g loadings are correlated with black-white gaps on them. MCV research has consistently found support for a weak form of Spearman’s hypothesis, according to which g is the major, but not the only, source of black-white differences in cognitive ability.

MCV has been criticized for relying on untested assumptions. For example, it can indicate that a group difference is due to g even in the presence of severe test bias. MCV also presupposes the g factor model of intelligence to be correct and does not assess whether other models could account for black-white differences equally well.

Multi-group confirmatory factor analysis (MGCFA) has been presented as a superior method to test Spearman’s hypothesis. Dolan (2000) and Lubke et al. (2001) investigated the sources of black-white differences using MGCFA. They found, firstly, that test bias is unlikely to account for Spearman’s hypothesis or the gap in general as far as black-white comparisons in America are concerned. Secondly, they found that while the weak form of Spearman’s hypothesis is supported when using higher-order factor models, different models with correlated factors instead of g could explain black-white differences equally well.

The problem with the argument that non-g models could account for the black-white gap is, in my opinion, that sight is lost of the well-replicated finding of correlations between g loadings and black-white differences. The Spearman-Jensen effect vanishes in the correlations between the factors. Even if non-g models fit the data equally well as g models, the former cannot explain the correlations between g loadings and gaps.

In their new paper, Frisby and Beaujean use MGCFA to study the black-white gap in the standardization sample of the Wechsler Adult Intelligence Scale-Fourth Edition and the Wechsler Memory Scale-Fourth Edition, which together comprise more than twenty tests. The novelty of their study is that they use the bifactor model to conceptualize individual differences in the tests. This contrasts with the higher-order factor models used in the MGCFA studies I described above. A higher-order factor model looks like this:

HOF model

In this model, both g and the five other factors explain individual differences in tests (V1-V15), but the influence of g on tests scores is fully mediated by the other factors that influence test scores directly. In contrast, a bifactor model looks like this:

bifactor model

In this model, all factors, including g, have a direct effect on test scores. The g factor is insensitive to the factor extraction method, so g loadings are the same regardless of which factor model is used. In contrast, the bifactor model greatly changes the meaning and magnitude of the non-g factors compared to the higher-order model.

In the higher-order model, non-g factors always comprise both g variance and specific factor variance, with the result that all factors are positively intercorrelated. In the bifactor model, the non-g factors are uncorrelated with g and each other, and each of them comprises only specific factor variance. In the bifactor model the factors are “pure” representations of specific hypothesized abilities. This also means that non-g factors are much smaller in the bifactor model, explaining less of the test battery’s total variance. The variance explained by the g factor is unaffected by the choice of factor model.

Frisby and Beaujean fitted a bifactor model (g + 5 smaller, uncorrelated factors) to 20+ tests from the Wechsler batteries. Before that, they did a MCV analysis of black-white gaps on these tests, and found a correlation of about 0.60 between black-white gaps and g loadings, which supports the weak version of Spearman’s hypothesis. In the MGCFA analysis, they similarly found that the g factor is the major source of black-white differences but that other factors contribute as well. Because they found that strict measurement invariance holds between races, all racial differences in mean test scores could be explained by racial differences in the means of the latent factors. Here’s a table of their major results:

Frisby & Beaujean results

The standardized white-black gap on the latent g factor is 1.16 SDs, which is somewhat greater than the 1.06 SD gap on WAIS full-scale IQ. This difference (although probably non-significant) is expected due to the fact that the g loading of full-scale IQ scores is not 1.0, but more like 0.9. On the other factors, there’s a small white advantage on the Verbal Comprehension factor and a large white advantage on the Visual Processing factor. On the Long-Term Retrieval factor, there’s a substantial gap in favor of blacks, but there are no discernible racial differences on the Working Memory and Processing Speed factors. This pattern of differences is similar to that found in earlier studies, such as Jensen & Reynolds (1982). As there were no significant racial differences in the variances of the latent factors, these gaps are straightforward to interpret — the standardized factor variances and SDs for each race are 1.

Note that in the bifactor model factors are orthogonal, so a racial gap on an ability factor does not necessarily mean that there’s a corresponding gap on tests that assess that ability, given that individual and group differences on tests are typically determined by multiple factors. For example, blacks have an advantage of 0.35 SDs on the Long-Term Retrieval factor, but whites outscore blacks on each of the five memory tests that define that factor. This is because those five tests have higher loadings on the g factor than on the Long-Term Retrieval factor (two of them also load on other factors). This is a nice demonstration of the primacy of g: even tests specifically designed to assess memory capabilities tap more into g than into specific memory skills.

Frisby and Beaujean’s paper is a nice addition to the debate. They quantify the contribution of g and other factors to black-white differences and explain the MCV results while keeping a close eye on psychometric issues. Even so, the paper is unlikely to convince those who dispute the significance of the Spearman-Jensen effect. It is always possible to argue that g is an artefact and that the association between g loadings and black-white gaps is due to some mysterious forces that happen to be collinear with g.

A limitation in Frisby and Beaujean’s study is that their black sample is small (N=140-180). Their white sample is largeish (N=590-835). This suggests that their measurement invariance evaluation is suspect. Because the samples across groups are lopsided, the fit indices they report mostly reflect fit in the larger, white sample. Misfit in the black sample may go unnoticed. However, strict measurement invariance across blacks and whites in IQ tests is a routine finding by now, so this is probably not a big problem.


Frisby, C.L., & Beaujean, A.A. (2015). Testing Spearman’s hypotheses using a bi-factor model with WAIS-IV/WMS-IV standardization data. Intelligence, 51, 79–97.


  1. Meng Hu

    Good thing that you have posted that article. It’s been a while that I have attempted bifactor models but they usually fail to converge or lead to impossibility to compute standard errors (i.e., the model is probably unidentified) and I have no idea about how to do that correctly, and Beaujean (when I emailed him) seemed unable to find the solution to my problem. I don’t really know if it depends on the data set, however. When I replicated Dolan (2000) study, I see no problem with my bifactor model that I have done in R (by the way, Dolan’s LISREL syntax on bifactor shows that the weak version of SH has a superior fit to either the strong version of SH or the no SH model). But using the WJ data used in Murray’s (2007) analysis on the BW changes over time, I had these aforementioned problems with bifactor. I will try to ask some people, and if I can’t resolve that problem, I will publish the result without bifactor modeling.

    Concerning the paper, the aspect that I appreciate the most is that I am now convinced that the bifactor model has many advantages over the higher-order-factor model usually applied. That’s interesting, especially concerning the gender gap in IQ, usually thought and found to be non-g. The only study using bifactor model was by Paul Irwing (2012) “Sex differences in g: An analysis of the US standardization sample of the WAIS-III”. And unlike the other studies, it found that the gap has something to do with g.

  2. Emil Kirkegaard

    I have built a simple simulator for Jensen’s method here:

    I am in the process of building an advanced simulator where one can specify all the factor loadings, number of subtests, number of group factors, and the group difference in each factor. Then the simulator will automatically fit bi-factor and higher-order models to this, perform measurement invariance tests etc. All in a large Shiny app.

    The app is being developed here. So you can just run this to try the latest version (may have bugs!).

Leave a Reply

Your email address will not be published. Required fields are marked *

© 2024 Human Varieties

Theme by Anders NorenUp ↑