IQ and Personality: What James Heckman Got Wrong

A few years ago James Heckman, together with some other economists, published a study arguing that “achievement tests” and “IQ tests” are different beasts: the former, they claim, are better predictors of criterion outcomes (such as grade point averages) and are more strongly influenced by personality differences than the latter. Like most of Heckman’s forays into psychometrics — he has been obsessed with trying to shoot down Bell Curve -type arguments ever since the book was released — the study leaves much to be desired. David Salkever has published a nifty reanalysis of Heckman and colleagues’ study, showing that their results stem from faulty imputation and a failure to take into account age effects.

Borghans, Heckman et al. analyzed the nationally representative NLSY79 sample and a smaller Dutch high school sample. They argue that the AFQT used in the NLSY79 sample and a similar test, the DAT, used in the Dutch sample, are achievement tests measuring learned skills, not IQ tests. They compare these “achievement tests” to various others that they call IQ tests, arguing that the latter are relatively pure measures of cognition. In the NLSY sample there are several different “IQ tests”, including the Otis Lennon Mental Ability Test, the Lorge-Thorndike Intelligence Test, and the Wechsler Intelligence Scale for Children. In the Dutch sample “IQ” was measured with Raven’s Progressive Matrices.

The results of Borghans, Heckman et al. are shown in the following graphs:

The graphs show R-squared values obtained when “achievement test” scores and grades are regressed on “IQ test” scores and various measures of personality in the American (NLSY79) and Dutch (Stella Maris) samples. It can be seen that IQ and personality both independently explain variance in achievement tests and grades.

The authors also report correlations between “IQ”, AFQT, and GPA in the NLSY79:

What is notable about this table is that the IQ-AFQT correlation is only moderately strong at 0.65, suggesting that they are not two measures of the same thing. Moreover, the correlation of GPA with AFQT is higher than with IQ, suggesting that AFQT is a better measure of success in school.

Based on the results shown in these three graphs/tables, the authors conclude that achievement tests are better predictors of important outcomes and that their superiority is due to them measuring personality variables in addition to cognitive ability whereas, they argue, “IQ tests” measure only cognitive ability. Their suggested causal model looks like this:

In this model, IQ and personality both independently influence (paths a and b) scores on an achievement test like the AFQT. The authors don’t use any research design — such as a behavioral genetic or longitudinal one — that might allow them to make strong causal inferences. They base their conclusions on the fact that the relative sizes of the associations between variables in their cross-sectional data are just what the model depicted above would produce.

However, the distinction Heckman and colleagues make between achievement tests and IQ tests is prima facie implausible. If the achievement tests they analyze had been tailored to match, say, the curriculum requirements of a specific school district, they might have a point. However, the AFQT and the DAT are in fact quite abstract and general, and there is much content overlap between them and what Heckman et al. call IQ tests. Because of the principle of indifference of the indicator, overall scores from these various tests are all expected to load strongly on the g factor and correlate highly with each other. Furthermore, to the extent that personality traits are correlated with IQ, the correlations should be approximately the same regardless of the test used.

So why did Heckman et al. get the results they did? David Salkever shows in his reanalysis of the NLSY79 that it was due to two mistakes.

Firstly, while the NLSY79 has AFQT scores for all ~12,000 participants, it has “IQ scores” (collected from school transcript data) for fewer than 3,000 participants. To make scores from different IQ tests comparable, Heckman et al. used an imputation procedure to convert them to percentile scores for individuals who had standard IQ scores but no percentile scores. This procedure was used for about two-thirds of the sub-sample with reported IQ scores, and they used these imputed data in all their analyses (this fact is mentioned only in the study’s web appendix which is not hosted at the journal’s site). Salkever shows that using only the actual, non-imputed IQ percentile scores, the IQ-AFQT correlation rises from 0.65 to 0.75. Dropping the imputed data also halves the increase in AFQT R-squared due to personality factors from 0.05 to 0.025. While imputation is supposed to produce more accurate parameter estimates, it appears that the procedure used by Heckman et al. produced highly noisy estimates of IQ scores. This may have been because two-thirds of the data were missing, or because the assumptions required for successful imputation were violated (IQ data were not missing at random).

The second mistake identified by Salkever concerns the age at which the various tests were taken. All NLSY79 participants took the AFQT in 1980 when they were between 14 and 22 years old. The personality tests were taken in 1979. In contrast, the “IQ tests” in the NLSY data were typically taken between 1965 and 1975. This is problematic because the stability of IQ is quite low in early childhood, and the rank order of intelligence within age cohorts doesn’t crystallize until the teenage years. The classic Berkeley Growth Study from the mid-20th century where a sample (N=61) repeatedly took IQ tests from infancy to age 17 shows this neatly:

Salkever dealt with the age problem by restricting the analysis to tests that were taken after early childhood. Even with such restrictions, sample sizes remained decent. For example, when only IQ tests taken after 1974 are included, the IQ-AFQT correlation is 0.82 (N=539). Correlations this high suggest that the tests measure the same thing, and differences between them are mainly due to measurement error. The AFQT variance that personality traits explain independently of IQ also decreases with appropriate age restrictions — the actual increase in R-squared due to personality may be as little as 1 percentage point, compared to 5 points in Heckman et al. (further reducing the time gap between IQ and personality tests would probably eliminate the predictive value of personality altogether). The following table shows Salkever’s main results:

Salkever looked only into the NLSY79 data, but Heckman’s Dutch data are no more convincing. Firstly, the Dutch sample includes students from only one of the three academic tracks in a high school, precluding generalizable conclusions. Secondly, the IQ test used in the Dutch sample consists of eight Raven items, while the DAT “achievement test” it is compared to is a full battery. The reliability of such a short Raven’s test is so low that comparisons are not meaningful.

Based on these reanalyses it’s obvious that Heckman’s distinction between achievement tests and IQ tests is untenable. The AFQT and the DAT are no less IQ tests than, for example, Wechsler’s tests. Moreover, personality variables are similarly associated with all IQ tests.

The actual causal model explaining associations between IQ, AFQT, and personality in the NLSY79 may be as simple as this:

In this model, IQ and AFQT both have the same loading a on the g factor. If we assume that the IQ-AFQT correlation is 0.84, as in Salkever’s Model E (see table above), then a=0.92, which is a plausible value for a g loading of composite test scores.

Heckman et al. assume that the correlation between personality and cognitive test scores is due to the former influencing the latter, but the opposite may well be true, as in the model above (path b). While (the rank order of) intelligence stabilizes in teenage years, personality traits (or at least self-reports thereof) remain highly volatile well into one’s 20s, suggesting that intelligence is more likely to be the source trait. The correlation may also be due to pleiotropy. There’s a substantial psychometric literature on the associations between IQ and personality, but Heckman et al. largely ignore it.

All in all, Heckman’s study is remarkably sloppily done. Rather than a balanced empirical investigation, it looks like a tendentious attempt at providing support for some preconceived ideas of personality and intelligence. Based on this study, you would never guess that he got a Nobel prize for his work in econometrics.


Borghans, L., Golsteyn, B. H. H., Heckman, J., & Humphries, J. E. (2011). Identification problems in personality psychology. Personality and Individual Differences, 51, 315–320.

Salkever, D. (2015). Interpreting the NLSY79 empirical data on “IQ” and “achievement”: A comment on Borghans et al.’s “Identification problems in personality psychology.” Personality and Individual Differences, 85, 66–68.

3 thoughts on “IQ and Personality: What James Heckman Got Wrong

  1. Excellent, understandable critique, Dalliard. One correction: you say: “the rank order of intelligence within age cohorts doesn’t crystallize until the teenage years”. In any reasonably large representative sample (~100s) there will never be a strict and rank ordering of scores that remains the same on retest. The table you provide shows that there isn’t much change in reliability after age 8, which fits with other data I have seen.

    I’d like to add a few points that are no news to you but seem to be a surprise to many.

    IQ is not a measure of intelligence, but a very rough approximation of the rarity of intelligence relative to a reference population of the same age. If you want a real, equal-interval measure of intelligence comparable between different ages, then only a Rasch measure such as the W score of the Woodcock Johnson or the CSS of the Stanford-Binet fits the bill. All intelligence scores whether rarity/IQ or Rasch measures are particularly unreliable for those in the tails of the distribution and especially those with near-ceiling scores on any sub-test.

    In contrast to individual scores, even relatively poor intelligence measures on representative samples of hundreds of people are quite reliable for comparing populations. (At least if one does the analysis better than Borghans, Heckman et al.)

    • I didn’t mean to suggest that IQs ever become set in stone. The point was simply that IQ differences aren’t highly stable in childhood — although see here, based on the same Berkeley data.

  2. “Like most of Heckman’s forays into psychometrics — he has been obsessed with trying to shoot down Bell Curve -type arguments ever since the book was released”

    It’s funny how Heckman’s war on The Bell Curve only got started after the book was published, since (as I’ve been told by somebody who would know) that Heckman read The Bell Curve in manuscript and the published version reflects many of his suggestions, which is why Heckman’s name is the first name mentioned in the Acknowledgments on p. xxv in The Bell Curve.

Leave a Reply

Your email address will not be published. Required fields are marked *