I analyze two studies who provide the necessary data for studying the test-retest effects, namely, Watkins (2007), Schellenberg (2004, 2006). Both used the Wechsler’s subtests, and the correlations between the IQ changes among those subtests with g-loadings are negative, in line with earlier studies on this topic.

Introduction. For instance, Skuy et al. (2002), Coyle (2006), Nijenhuis et al. (2001, 2007) as well as Reeve & Lam (2005, 2007), Lievens et al. (2007), Matton et al. (2009, 2011), Freund & Holling (2011), Arendasy & Sommer (2013), were able to demonstrate that test-retest effects (or experiences) through educational meanings are not related to the g factor.

Nijenhuis et al. (2007) analyzed the effects of the secular gain, showing a perfect (true) negative correlation with g. They also re-analyzed Skuy et al. (2002) data with the results that IQ gains evidenced in black, white, indian and colored participants who took the Mediated Learning Experience (MLE) were not g-loaded. The cognitive test was the well-known RSPM. Concretely, the authors correlated gain scores with pretest RSPM scores. Also, the decline they found in the total variance explained by the first unrotated factor loading, or g, in the post-test scores is likely due to the higher test sophistication following MLE. Nijenhuis (2001, p. 306) discussed this matter earlier. Skuy et al. (2002) showed earlier that those gains in the RSPM are not generalizable to other tests (e.g., Stencil) that require abstract thinking, and thus not g-loaded again, even if the black south african students improved more than did the white students. Matton et al. (2011) found more or less the same thing. While retakers scored higher than first-time test takers on ‘old’ tests, the first-time test takers on ‘new’ tests and those retakers do not score differently on these ‘new’ tests. They argue that differences in mean structure suggest violation of measurement invariance.

Coyle (2006, Table 3) found, using a mix of achievement and intelligence tests, that while the SAT is highly loaded on PC1, the SAT changes loaded significantly on PC2 with its loading on PC1 being (near) zero. He also demonstrates in a second analysis that, on the one hand, SAT scores predicts GPA but, on the other hand, SAT changes do not (Table 5).

Reeve & Lam (2007, Table 3) also correlate the vector of gain with the vector of g-loadings, with proof of a significant negative association. They also found (Table 4) that while test-taking motivation is somewhat associated with score gains in the IQ composite, the motivational factor does not display a consistent pattern of correlations with score gains among the IQ scales; negative in some domains and positive in others. But if, on one hand, such relationship is established and that, on the other hand, measurement invariance in retest effects is violated, this would mean that motivation does not improve g itself. This aside, the same authors (2005) found in a previous study, on the other hand, tenability of measurement invariance, and that gs (derived from independent CFAs) at day #1, day #2 and day #3 were strongly correlated (Table 5). They also showed (Table 6) that practice effect does not cause a significant change in the criterion related validity, as the g-factor score correlation with self-reported GPA does not change significantly across day #1, #2 and #3. Nevertheless, they note “Given that not all applicants may have had exposure to practice, differences in observed test scores may not accurately reflect individual differences on the construct of interest (i.e., g). That is, although the indicators on the test continue to relate to g and to narrow group factors in the same way across testing occasions, the observed total test score is likely to increase due to changes on either test-specific skills (i.e., skills not shared across the various scales) or other non-cognitive constructs. Thus, applicants who re-test are essentially being given the opportunity to boost observed scores by practicing those non-ability components. In addition to questions of fairness, such differences might alter the predictive validity of the observed total test scores (Sackett et al., 1989), even though the predictive validity of g and narrow group factors would remain unchanged.” (p. 546).

The finding in that paper is discussed in Lievens et al. (2007, p. 1680) who showed that test-retest comparisons indicate measurement bias. Statistically, they test the tenability of the four levels of invariance, 1) configural invariance (i.e., equality in number of factors and factor loading pattern across groups), 2) metric invariance (i.e., factor loadings equality), 3) scalar invariance (i.e., intercepts equality), and 4) uniqueness invariance (i.e., error terms equality). The rejection of the metric suggests that the test measures different factors across groups while the rejection of the scalar suggests that items are of unequal difficulty across groups (that is, they depend on group membership). The second level of invariance is known as weak measurement invariance, third level as strong invariance, and fourth level as strict invariance. Measurement invariance holds when at least the strong level of measurement invariance is tenable. What they found in fact is that metric and uniqueness invariance are violated. Increases in scores are due to test uniqueness, consistent with Lubinski’s theory, as they describe it : “Lubinski’s (2000) suggestion that practice builds up “nonerror uniqueness” components of ability tests – factors such as method-specific knowledge [1] (aka, test-wiseness), specific item content knowledge, or narrow skills unique to the item content (e.g., memorization of numbers).” (p. 1675). The non-error uniqueness they are talking about refers to the fact that uniqueness per se carries both random error and systematic variance not shared with other indicators. Their predictive bias analysis supports the view that practice effects do not affect g, as they write : “These results reveal that the general factor derived from the retested data (i.e., Group A2) did not predict GPA (r = .00, ns), whereas the general factor derived from the group who did not retest (Group B) did predict GPA significantly (r = .48, p < .01).” (p. 1678). As additional analyses (Table 5), they correlated the latent general factor with scores on a memory test that was included in the full ability battery and derived an independent g score based on the scale scores of the remainder of the cognitive battery (by conducting a principal factor analysis and retaining the first unrotated principal factor, they name it GCA variable). The correlation between the latent general factor and GCA was higher for the one-time test takers (group B) than for the two-time test takers (Group A2 for two-time test taking, A1 for first-time), with r of 0.40 versus 0.22. Also, the correlation between the latent factor and memory increases with re-administration (from -0.03 to 0.29). Furthermore, the latent factor derived from retest scores was correlated more strongly with memory (r=0.29) than it was with the GCA variable (r=0.14). This finding, they insist, is consistent with Reeve & Lam (2005, pp. 542-543) who found that after each repeated measurement or administration, the variance accounted for by the short-term memory factor increases while g, verbal, visual-spatial, and quantitative factors do not vary. Finally, using Jensen’s MCV, Lievens et al. report only a moderate correlation (r=0.27) between vectors of factor scores derived from group A1 and A2, seperately.

Matton et al. (2009) tried to test measurement invariance as well, and came to the conclusion that metric invariance is violated, but also that gain scores could be explained by common situational effects, as invariance in error variances has been retained which means that errors at both times were correlated. They argue that earlier studies may have tested whether or not score gains reflect test-specific abilities but not if score gains could be attributed to situational effects, which encompass all effects due to the specificity of the state of the person in the current situation, that they describe as follows : “Situational effects were first formalized within the SEM framework in the Latent State-Trait Theory (Steyer, Ferring, & Schmitt, 1992; Steyer, Schmitt, & Eid, 1999). This theory states that any test score measures characteristics of the person (traits), but also measures characteristics of the situation and characteristics of the interaction between person and situation. Taken together these factors create a psychological state specific to the situation to which the person is exposed. Following this theory, a test never measures trait differences only but also individual differences due to situational effects.” (p. 413). For an illustration, see Jensen’s analogy of shadow measurement (1998, p. 312).

Freund & Holling (2011, pp. 238-239) were able to demonstrate that score gains in computer-generated matrices items violate the item difficulty parameter invariance. They compare four groups : 1) training + identical retest, 2) training + parallel retest, 3) no training + identical retest, 4) no training + parallel retest. The distinction is important because training effects, as opposed to practice effects, involve interventions of some kind. It is known, as they say (p. 234), that effects sizes for training effects surpass the effect sizes for practice or retest effects. The authors report higher retest gains for the training groups (vs control). Although they also report higher gains for identical (vs parallel) test forms, it appeared nevertheless that these differential gains disappear altogether when individual’s (general intelligence) variation has been controlled.

Arendasy & Sommer (2013) provide another test of measurement invariance using MGCFA. At the test score level, they found that “the strong measurement invariance model (M3) assuming equal intercepts across test forms fitted the data significantly worse than the weak measurement invariance model (M2).” even if strict measurement invariance across test forms is supported at the item level. They finally test the relationship between score gains and g-saturation using MCV. This correlation was found to be -0.29 although there were only four tests used. They conclude that retesting induce uniform measurement bias. The authors suggest that (p. 184) identical and alternate retest forms must be taken into account because this might influence the results. Their comment on the previous findings mentioned above is worth considering :

In line with Reeve and Lam (2005) our results indicated strict measurement invariance within- and across test administration sessions at the item level; indicating that retest score gains are attributable to an increase in narrower cognitive abilities. Although this finding confirmed our hypothesis regarding the two alternate retest forms, we would have expected to find measurement bias at the item level in case of the identical retest forms. The finding that measurement invariance at the item level can even be assumed for identical retest forms contradicts previous research findings (cf. Freund & Holling, 2011; Lievens et al., 2007). Several design characteristics of our study may account for this seemingly conflicting finding. First, Freund and Holling (2011) never examined measurement invariance across test form in a between-subject design. Therefore their finding that retesting induces uniform measurement bias in case of identical retest forms could also be due to differences in the psychometric characteristics of the two test forms. …

The interpretation of retest effects in terms of an increase in narrower cognitive abilities has also been supported in our multigroup confirmatory factor analyses. The results indicated that weak measurement invariance can be assumed, which means that retesting does not affect the g-saturation of the four cognitive ability tests. However, retesting induced a uniform measurement bias, which indicated that retest score gains are confined to narrower cognitive abilities and do not generalize to psychometric g.

Overall, the conclusion from these papers appear consistent with the failure of most educational intervention (e.g., Milwaukee Project) in generalizing the IQ gains (Herrnstein & Murray, 1994, pp. 408-409; Jensen, 1998, pp. 340-342). Another interesting finding is from Ritchie et al. (2013). They show that education (controlling for childhood IQ score) was positively associated with IQ at ages 79 (sample #1) and 70 (sample #2) but there was no improvement in processing speed, which strongly suggests that education does not improve g.

Study #1. The details of the Watkins’ sample (N=289) are described in Watkins et al. (a paper that is worthy of reading, regardless of the present analysis), so there is no need to repeat it here. Watkins (2007, Table 3) provide us with reliability coefficients for each tests. I haven’t used them for my test of correlated vectors method below, but I tried after that to correlate the r(g*d) using correction for unreliability. The correction of d changes didn’t affect the correlations but the corrected PC1 (not corrected PC2) produces lower negative correlations with g-loadings. Anyway, the cognitive test used is a mix of WISC-III and achievement tests. The test-retest interval was of 2.8 years.

The first analysis displays the scatterplot of IQ changes against g-loadings. In a second analysis (see Coyle 2006), I have added to the initial (sub)test intercorrelations the IQ gains for each subtests (see the attached EXCEL at the bottom of the post) and re-ran a principal component analysis.

Test-retest effect - no g gains (fig.1)

Test-retest effect - no g gains (table.1)

It is clear from the above that IQ changes are not related to g, and have high loadings on non-g variance. The same phenomenon can be seen in Rushton (1999). Further comment is not needed, as the numbers speak for themselves.

Study #2. Schellenberg (2004, 2006) gives the detail for this (seemingly non-random) sample (N=144). The experimental group comprises 72 children, and the control group 72 as well. Both groups are composed of 2 subgroups (keyboard and voice for the experimental, drama and no lessons for the control) but given these moderate samples, I chose not to divide the groups even further. The cognitive tests used are the WISC-III and the Kaufman Test of Educational Achievement (K-TEA) but because I can’t find the g-loadings for this battery, I left that one. The test-retest interval was of 1 year, as the groups under study received music lessons for one year.

Test-retest effect - no g gains (fig.2)

Test-retest effect - no g gains (fig.3)

Of particular interest, as we already see, is that the negative correlation between IQ gains and g is larger in the experimental than in the control group. This was also the case in te Nijenhuis study (2007, p. 294). It appears that the gains were larger on the less g-loaded subtests. It should be noted, nevertheless (Schellenberg, 2006, pp. 461-462), that music lessons seem to be positively associated with academic performance even after individual differences in general intelligence were held constant. Interestingly, although the long-term association beween music lessons and cognitive abilities is well proven in his (2006) study, he admits the following : “In Study 2 (undergraduates), each additional year of playing music regularly was accompanied by an increase in FSIQ of one third of a point (b = .333, SE = .134), after partialing out effects of parents’ education, family income, and gender. In childhood, then, six years of lessons (assuming 8 months of lessons per year) was associated, on average, with an increase in FSIQ increase of approximately 7.5 points, which is half a standard deviation and far from trivial. But the same 6 years of playing music regularly in childhood were predictive of an increase in FSIQ of only 2 points in early adulthood. In other words, short-term associations were stronger than long-term associations, which is in line with other findings indicating that associations between cognitive functioning and environmental factors decline throughout childhood and adolescence (Plomin et al., 1997).” (p. 465).

Discussion. It’s a fact that studies in which (IQ) subtest gains/changes have been reported are relatively rare, so further analyses are needed; to note, some scholars (Dragt, 2010; Smit, 2011; Repko, 2011; Metzen, 2012; te Nijenhuis & van der Flier, 2013) have studied the relationship between g and gains. They may be worthy of reading. Also, as I said previously, MCV suffers from the small numbers of subtests, and even 10 subtests are not large enough for the results to be very robust, and just one subtest can be expected to determine most of the strength of the correlation and its sign as well. This was obvious in my analysis of the Capron & Duyme adoption study, due to Coding subtest. It does not mean we should remove this subtest, but that a larger test battery or further replication is needed. In any case, here’s the Excel file (XLS) for additional numbers and correlation matrices for these two studies.


Marley W. Watkins, Pui-Wa Lei, Gary L. Canivez. 2007. Psychometric intelligence and achievement: A cross-lagged panel analysis.
E. Glenn Schellenberg. 2004. Music Lessons Enhance IQ.
E. Glenn Schellenberg. 2006. Long-Term Positive Associations Between Music Lessons and IQ.