Page 2 of 16

DIF Review and Analysis of Racial Bias in Wordsum Test using IRT and LCA

As reviewed in my previous article, the majority of studies on measurement bias, either on the item- or subtest-level, reached an agreement about the fairness of IQ test. Unfortunately, even among studies which use acceptable Differential Item Functioning (DIF) methods, the procedure was often sub-optimal. This probably leads to more spurious DIFs being detected.

The advantages (and shortcomings) of each DIF method are presented. The GSS data is used to compare the performance of the two best DIF methods, namely IRT and LCA, at detecting bias in the wordsum vocabulary test between whites and blacks.
Continue reading

Fair and Square: A Conclusion on IQ Test Bias

This is a 2-part article. In this first part, the most important studies on internal test bias with respect to racial groups in the item-level, subtest-level and construct-level are reviewed. The proposed causes will be discussed. Generally, the most commonly used IQ tests aren’t biased or only minimally biased as to be of no practical value.

The best methodologies with an application using the Wordsum GSS for the Black-White group will be discussed in the second part of the article : DIF Review and Analysis of Racial Bias in Wordsum Test using IRT and LCA.
Continue reading

Schooling enhances IQ, not intelligence

The idea that schooling raises intelligence still prevails. The influential study review of Ceci (1991) concluded that schooling has a strong impact on IQ scores despite his final warning that observed score does not equate real intelligence. After, many more studies were published, including latent factor modeling and quasi-experimental designs. It is unclear whether education truly improves general intelligence modeled as latent factor or whether long-lasting IQ gain involves far transfer effect. More likely, the answer to all of these questions is negative.

Continue reading

The Inconvenient Truth Behind the Black-White Income and Mobility Gap

Back in 2014 I wrote an extensive review of studies on the income mobility rate over time and across countries and discussed whether it truly fits the Great Gatsby Curve, a term based on the observation of the negative relationship between mobility and inequality, that is considered by many as unfair because it implies that higher inequality causes lower mobility. However I did not consider Black-White difference in mobility. Because mobility and inequality are interrelated, I will cover both topics here.

Continue reading

Links for Fall ’22


  • Psychologist Bryan Pesta was fired from his tenured position at Cleveland State University. The reason given was careless handling of protected data, but, as detailed in the linked article, the firing was the culmination of many years of campaigning against Pesta by activists who were incensed by his research on racial differences. Among other things, Pesta is a coauthor of Global Ancestry and Cognitive Ability, a seminal study that found that IQ increases linearly as a function of European ancestry in black Americans, independently of skin color (a result that has been replicated in independent samples).
  • The academic journal Nature Human Behavior announced that it seeks to suppress all research that has even the slightest whiff of HBD, possibly including even research on purely cultural differences. Of course, many other journals already follow something like this policy, only less openly (e.g., Behavioral Sciences), so this is more of a codification of the fait accompli than something new. As bad as it is, it could easily get much worse. Universities have hired thousands upon thousands of faculty and administrators wedded to the blank slate dogma and its attendant conspiracy theories about group differences, and these people may well be able to throw their weight around in the coming years, plumbing new depths of foolishness. Meanwhile, the more level-headed researchers will be easily cowed into silence, and ever wider areas of human behavior will be closed off to honest research.
  • Don’t Even Go There by James Lee. A case in point regarding the widening circle of censorship is the blocking of access to publicly funded genomic data to researchers studying “dangerous” topics such as human intelligence. As explained in the article, this suppression includes analyses focused strictly on individual differences, not just group differences. Vague insinuations that amorphous harms could occur if human differences were freely studied are enough to stop research now.
  • The ISIR Vienna affair by Noah Carl. A post mortem of the cancellation of Emil Kirkegaard at an intelligence research conference last summer. As Carl notes, the instigator, geneticist Abdel Abdellaoui, has himself been subject to attacks by activists for some of the same offenses that he took Emil to task for. Abdellaoui’s leftist detractors reject his protestations to the contrary and treat his research as a stalking horse for the racial questions that are explicit in Emil’s writings. In this (and only this) respect they are onto something, I think. Abdellaoui tries to draw a bright line between the good, moral individual differences research he is engaged in, and the bad, immoral group differences research of Emil and others. However, individual differences and group differences are made of the same stuff, and trying to stave off the latter while championing the former cannot be done with intellectual consistency. A good starting point for research on group differences is Cheverud’s conjecture which asserts that if there is a phenotypic correlation between two traits, the expectation should be that there is also a genetic correlation of a similar magnitude between them. So, if there is a phenotypic correlation between racial identity and IQ, one should bet on there being a genetic correlation, too.
  • Gender Gaps at the Academies by Card et al. This paper analyzed the publication and citation records of male and female psychologists, economists, and mathematicians elected as members of the National Academy of Science or the American Academy of Arts and Science over the last sixty years. In a sample of all authors who had published in the top dozen or so journals in each field, women were equally or (non-significantly) less likely to get an academy membership in the 1960s and 1970s conditional on publications and citations. In the 1980s and 1990s, there was gender parity or some female advantage, but in the last twenty years, a large gender gap has emerged, with women being 3–15 times more likely to become members conditional on publications and citations. While this kind of study design is vulnerable to omitted variable bias, the female advantage is now so large that it is likely that the men elected to membership in these organizations are of clearly higher caliber than the women.
  • Septimius Severus Was Not Black, Who Cares? by Razib Khan. In today’s academia, centering the black experience is seen as a moral imperative, so given that blacks have been non-players in most of the world’s history, there is now a strong incentive to transmogrify historical figures of uncertain ancestry into blacks, a practice with a long tradition in Afrocentric pseudoscholarship. Razib’s post is a nice evisceration of an article by a history professor claiming that many prominent figures in Ancient Rome were black Africans, and even that “Black Romans were central to Classical culture”.


  • On the causal interpretation of heritability from a structural causal modeling perspective by Lu & Bourrat. According to the authors, the “current consensus among philosophers of biology is that heritability analysis has minimal causal implications.” Except for rare dissenters like the great Neven Sesardic, philosophers seem to never have been able to move on from the arguments against heritability estimation that Richard Lewontin made in the 1970s. Fortunately, quantitative and behavioral geneticists have paid no attention to philosophers’ musings on the topic, and have instead soldiered on, collecting tons of new genetically informative data and developing numerous methods so as to analyze genetic causation. Lu & Bourrat’s critique of behavioral genetic decompositions of phenotypic variance is centered on gene-environment interactions and correlations. They write that “there is emerging evidence of substantial interaction in psychiatric disorders; therefore, deliberate testing of interaction hypotheses involving meta-analysis has been suggested (Moffitt et al., 2005).” That they cite a 17-year-old paper from the candidate gene era as “emerging evidence” in 2022 underlines the fact that the case for gene-environment interactions remains empirically very thin, despite its centrality to the worldview of the critics of behavioral genetics (see Border et al., 2019 regarding the fate of the research program by Moffitt et al.). As to gene-environment correlations, twin and family studies are well-equipped to deal with passive gene-environment correlations and population stratification, whereas active and reactive correlations (for definitions, see Plomin et al., 1977) are quite naturally regarded as ways for the genotype to be expressed, and thus can be subsumed under genetic variance. One can imagine a draconian experimental scenario where gene-environment correlations are prevented from occurring during individual development, so that, say, bookish people are not allowed to read more books, or extraverted people cannot befriend more people, or athletic people are prevented from playing sports. Behavioral genetic estimates based on individuals raised in such unnatural, circumscribed environments would hardly be more meaningful than, say, ordinary twin estimates which are based on comparatively uncontrolled environments. Lu & Bourrat also provide an extended treatment of heritability using Judea Pearl’s causal calculus, but I do not think Pearl’s machinery sheds new light on the topic. R.A. Fisher’s traditional definition of genetic causation as the average effect of gene substitution, i.e., what would happen, on average, if an individual had this rather than that variant of a gene, is in agreement with modern counterfactual frameworks like Pearl’s. Thus (additive) heritability is the standardized variance of the sum of the substitution effects of all loci.
  • Causal effects on complex traits are similar across segments of different continental ancestries within admixed individuals by Hou et al. Phased genotype data includes information on the origin of each allele, i.e., whether it came from a paternal or maternal gamete. In an admixed population like African-Americans, phased data enables the determination of whether a given variant, along with all the DNA linked to it, was inherited from a black or white ancestor. With this information it is possible to compare associations between a trait and the same SNP inherited from black and white ancestors in the same individual, provided that he or she is homozygous for the locus. It turns out that the effect sizes are correlated at 0.95 between variants inherited from white and black ancestors. If I am interpreting this correctly, this finding goes against the popular argument that the decay of GWAS effect sizes (and polygenic scores) in samples that are ancestrally distant from the GWAS discovery population is primarily due to the same SNPs failing to tag the true causal variants in different populations. Instead, the paper suggests that the main culprit for the decay effect is differences in allele frequencies between populations, the average effect of an allele being a function of both its actual effect and its frequency. The within-individual method of this study could also be used to resolve some race difference problems, such as the one I discussed in this note.


  • Theory-driven Game-based Assessment of General Cognitive Ability: Design Theory, Measurement, Prediction of Performance, and Test Fairness by Landers et al. In this study, a sample of >600 students completed both a traditional IQ test battery and a gamified test battery consisting of video games designed to assess cognitive skills. The correlation of the g factors from the two batteries was 0.97, indicating that general mental ability can be measured equally well in these two quite different modalities. This is a nice demonstration of Charles Spearman’s principle of the indifference of the indicator: any task that requires cognitive effort and discrimination between correct and incorrect stimuli can be used to measure g. The study also looked into black-white gaps in the sample. The game-based assessment gap was 0.77, while the gap in the traditional test was 0.95; however, the difference between the two gaps was non-significant. I think the data from the study are openly available, so you could fit latent variable models to see if the difference has a substantive cause. or if it is just noise.
  • Are Piagetian Scales Just Intelligence Tests? by Jordan Lasker. The latent correlation between g from IQ tests and the general factor of Piagetian tests was found to be 0.85 in a meta-analysis, so the two were highly similar and might have been completely indistinguishable if better, larger test batteries had been available.
  • Stop Worrying About Multiple-Choice: Fact Knowledge Does Not Change With Response Format by Staab et al. Arguably, the multiple-choice format typically used in standardized tests is suboptimal, contributing irrelevant method variance to test scores. This study compared multiple-choice and open-ended items in tests of knowledge in natural sciences, life sciences, humanities, and social sciences. While the open-ended items were somewhat more difficult, the two types of items ranked individuals in identical order (latent correlation ~1), meaning that “method factors turned out to be irrelevant, whereas trait factors accounted for all of the individual differences.”
  • Cognitive Training: A Field in Search of a Phenomenon by Gobet & Sala. The Holy Grail of cognitive training research is far transfer which means that the training produces generalized improvements across different abilities and not just near transfer, or better performance on the trained task and closely related tasks. As detailed in the article, this goal has not panned out–regardless of type of training, only near transfer is achieved. The implication for education is that it is best to focus on mastering specific content domains rather than trying to improve general reasoning abilities. On the other hand, this throws the importance of general intelligence into high relief: while human learning is content-specific, higher g makes it easier to gain an understanding of any specific topic, enabling both superior performance on novel tasks and a shorter path to mastery over any given knowledge domain.
  • Personality and Intelligence: A Meta-Analysis by Anglim et al. This very large-N meta-analysis found the reliability-corrected correlations of general intelligence with openness and neuroticism to be 0.20 and -0.09, respectively, while the correlations with extraversion, agreeableness, and conscientiousness were essentially zero. All kinds of personality types are found at every level of intelligence with nearly the same probability. These estimates are almost identical to those reported in a previous meta-analysis by Judge et al. (2007), so no new ground was broken in this respect, but at least some things replicate in psychology. Anglim et al. also meta-analyzed the relationship of intelligence to narrower aspects of personality, as well as sex differences in personality (for example, women score about 0.30 SDs higher in neuroticism and agreeableness).

Group differences

  • Measurement Invariance, Selection Invariance, and Fair Selection Revisited by Heesen & Romeijn. If two groups differ in their mean values (or variances) for some trait, an unbiased test measuring that trait will generally not be able to predict the performance of the members of those groups (in, say, a job or school) without bias with respect to group membership. This may lead to unfairness when the test is used to select individuals from different groups. I have previously discussed this phenomenon (Kelley’s paradox), and some of the related history, which goes back to the 1970s and even earlier, here. Heesen & Romeijn revisit this argument and express it in a more general form. They also note that Kelley’s paradox has been recently rediscovered outside of psychometrics, in research on algorithmic bias in machine learning. The paradox entails that when people are selected based on an unbiased test, and one group has a higher mean than the other but variances are the same, the higher-scoring group will generally have a higher true positive rate (sensitivity), and a higher positive predictive value, while the lower-scoring group will have a higher true negative rate (specificity), and a higher negative predictive value. When variances differ, too, the pattern of expected differences in error rates is more complicated, but if the higher-scoring group also has a higher variance, the differences would typically be in the same direction as when only means differ. These results apply not only to psychometric tests but to any (less than perfectly reliable) procedures for assessing and selecting individuals (or other units of analysis).
  • Role Models Revisited: HBCUs, Same-Race Teacher Effects, and Black Student Achievement by Lavar Edmonds. This study found that elementary school teachers who graduated from Historically Black College and Universities (HBCUs) have a small positive effect (about 0.03 standard deviations) on the math test scores of their black students compared to non-HBCU graduates. This was found for both black and non-black HBCU teachers, but there were no effects either way on non-black students. Overall, having a black teacher was not associated with superior black student performance, because non-HBCU-trained black teachers had a significant negative effect (-0.02 SDs) on their black students. However, I do not quite buy these estimates because the study has shortcomings that are common in observational studies, especially in economics. In particular, the study has a huge sample (thousands of teachers, hundreds of thousands of students), and even the simplest models reported contain two fixed effects and ten control variables, yet the reported effects are small and in some models barely significant (despite a no doubt extensive, even if unacknowledged, specification search). With such large models it is difficult to say what is even being estimated. My ideal model would be one where the theory is so well-developed and the data so deftly collected that a bivariate regression will yield a plausible causal estimate. The further away you move from this ideal, the less credible your causal claims become, so every additional variable is potentially problematic. I do not believe that drawing DAGs showing the causal pathways, or lack thereof, between your variables is nearly as useful as Judea Pearl and his acolytes think, but if you would not even be capable of drawing one because of how complex your model is, I do not think you should be making causal claims.
  • Racial and Ethnic Differences in Homework Time among U.S. Teens by Dunatchik & Park. According to time diary data, Asian-American high-schoolers spend an average of 2 hours and 14 minutes a day on homework, while the averages for white, Hispanic, and black students are 56 minutes, 50 minutes, and 36 minutes, respectively. On the other hand, a 2015 meta-analysis found a correlation of less than 0.20 between homework time and standardized test scores, so homework is not a strong predictor of achievement even before accounting for reverse causality. Then again, looking at, say, the skyrocketing SAT performance of Asian-Americans, I believe that they do get some returns on their Stakhanovite attitude to school.
  • National Intelligence and Economic Growth: A Bayesian Update by Francis & Kirkegaard. In 2002, Lynn & Vanhanen published IQ and the Wealth of Nations where the predictive power of national IQ was shown to be superior to that of the traditional predictors of growth used by economists. The most common response to this finding has been to ignore it, while the second-most common response has been to dispute the validity of Lynn & Vanhanen’s data. The problem with the latter approach is that even with all their shortcomings, national IQ data predict GDP and all other indices of development extraordinarily well, so it is unwise to dismiss them out of hand. Moreover, new and more carefully curated test score collections, such as the World Bank’s harmonized test scores, show strong convergent validity with national IQs. An exception to the neglect of IQ in the econometric growth literature is Jones & Schneider (2006) where national IQ was put to a severe test through Bayesian model averaging. They ran thousands of growth regressions with different sets of predictors, and found that the effect of national IQ was extremely robust to differences in model specification, indicating that IQ must be treated as an independent predictor of growth, not a proxy for something else. Francis & Kirkegaard’s recent study is an update and extension of Jones & Schneider’s analysis. They use more data and new robustness checks while explicitly comparing IQ and competing predictors. Across millions of growth regression models with different sets of predictors, they find that national IQ blows all other variables out of the water (with the exception of previous GDP, which seems to predict as well as IQ but negatively, reflecting the “advantage of backwardness”). The authors also use three instruments–cranial capacity, ancestry-adjusted UV radiation, and 19th-century numeracy scores (age heaping)–in an attempt to rule out confounding but whether the instruments really are exogenous in the GDP~IQ regression can be questioned. Reverse causality seems to be, to some extent, baked-in in national IQ estimates because of the Flynn effect. There are many questions of causality that remain unsettled, but given the incomparable predictive power of national IQs, no serious study of the wealth of nations should ignore them. The causal influence of national IQs is prima facie more credible than that of many other predictors because of the robust influence of individual IQs on socioeconomic outcomes.
  • Understanding Greater Male Variability by Inquisitive Bird. A lucidly written overview of greater male variability in cognitive tests. The post makes the interesting observation that the male-female variance ratio increases as the mean difference in favor of males in the test increases but that the male variance is larger even when the means are equal.
  • Skill deficits among foreign-educated immigrants: Evidence from the U.S. PIAAC by Jason Richwine. Using test scores from the PIAAC survey, this study found that immigrants to the U.S. score 0.82 and 0.54 SDs lower on measures of literacy and numeracy, respectively, compared to natives, after controlling for age and educational attainment. The gaps are somewhat reduced but remain significant after controlling for self-assessed English reading ability. Test score differences explain at least half of the wage penalty and “underemployment” (i.e., holding a job below one’s apparent skill level) experienced by foreign-educated immigrants. Richwine does not report country of origin effects, arguing that the sample is too small. However, the immigrant sample size is about 1500, so some analyses at the continental level would have been feasible.
  • Replotting the College Board’s 2011 Race and SES Barplot by Jordan Lasker. Mean SAT scores by race and parental income in 2011 (click to enlarge):

Re-analysis of Willerman’s Study: Race of Mother’s Hypothesis

It’s been almost 50 years now that the famous study of Willerman et al. (1974) has been published. This study is regularly cited as one of the most convincing evidence against the hereditarian hypothesis, despite strong emphasis by hereditarians on the failure of experimental efforts to raise IQ (more specifically, g) and population differences magnifying during adolescence or adulthood due to increasing heritability with age (Jensen, 1998, pp. 333-344, 359, 474; See Malloy [2013] for a case of a stability model with respect to the Black-White gap). Caution about this study is now vindicated. The data used by Willerman also revealed a pattern: the IQ deficits related to having a Black mother seem to vanish over time (Hu, 2022). Continue reading

Links for Summer ’22


  • A Note on Jöreskog’s ACDE Twin Model: A Solution, Not the Solution by Dolan et al. This critique was published on the heels of my own recent, critical post on Jöreskog’s twin model. Using Mendelian algebra and a simple one-locus model, Dolan et al. show that Jöreskog’s estimates are biased. They also note that the combination of MZ and DZ covariances that Jöreskog proposes as an estimator of additive genetic variance does not have the correct expected value. While these arguments are true and on point, in their short article Dolan et al. do not go into what I think is the main problem with Jöreskog’s model: the absurdity of the idea that minimizing the Euclidean norm would produce meaningful behavioral genetic estimates. They note that sometimes Jöreskog’s ACDE estimates may be less biased than ACE and ADE estimates, but that would be pure happenstance because the data generating mechanism suggested by Jöreskog’s model is never realistic. In contrast, the ACE model (or its submodel, AE) is often a realistic approximation of the true data generating mechanism, and even if this is not the case, the amount of bias is usually tolerably low, while the biases of Jöreskog’s estimates can be severe in typical datasets (e.g., if AE is the true model).
  • Polygenic Health Index, General Health, and Disease Risk by Widen et al. This is a paper from people associated with Steve Hsu’s eugenics biotechnology startup. With UK Biobank data, they build an index from polygenic risk scores for twenty diseases (e.g., diabetes, heart disease, schizophrenia), and show that lower values on this index are associated with a lower risk for almost all the diseases included and a higher risk for none. The index also predicts a longer lifespan, and works, with lower accuracy, within families (between siblings) as well. Thus the index is a candidate for use in embryo selection. A common anti-eugenic argument is that by artificially selecting for something positive one may inadvertently select for something negative. The paper shows that in fact one can simultaneously decrease the risk of many diseases without increasing that of any of them. Generally, the argument about accidental adverse selection rests on the tacit assumption that the status quo where eugenic and dysgenic concerns are ignored is somehow natural, neutral, and harmless. However, every society selects for something and it is seems unlikely that, say, embryo selection based on polygenic index scores would have worse consequences than the status quo. For example, selection against educational attainment and for increased criminal offending happen in some contemporary societies, but that is hardly any kind of inevitable state of affairs that should not be tampered with.

Cognitive abilities

  • Brain size and intelligence: 2022 by Emil Kirkegaard. A good review of the state of the brain size and IQ literature. It seems that the true correlation is around 0.30.
  • General or specific abilities? Evidence from 33 countries participating in the PISA assessments by Pokropek et al. Arthur Jensen coined the term specificity doctrine to refer to the notion that cognitive ability tests derive their meaning and validity from the manifest surface content of the tests (e.g., a vocabulary test must solely or primarily measure the size of one’s vocabulary or, perhaps, verbal ability). He contrasted this view with the latent variable perspective, according to which the specific content of tests is not that relevant because all tests are measures of a small number of latent abilities, most importantly the g factor, which can be assessed with any kind of cognitive tests (see Jensen, 1984). While the specificity doctrine has very little to recommend for it (see also e.g., Canivez, 2013), it remains a highly popular approach to interpreting test scores. For example, in research on the PISA student achievement tests, the focus is almost always on specific tests or skills like math and reading rather than the common variance underlying the different tests. Pokropek et al. analyze the PISA tests and show that in all 33 OECD countries a g factor model fits the data much better than non-g models that have been proposed in the literature. The PISA items are close to being congeneric (i.e., with a single common factor), with the specific factors correlating with each other at close to 0.90, on average. The amount of reliable non-g variance is so low that subtests cannot be treated as measures of specific skill domains like math, reading, or science. The correct way to interpret PISA tests is at the general factor level, which is where the reliability and predictive validity of the tests is concentrated. The relevance, if any, of specific abilities is in their possible incremental validity over the g factor.
  • Training working memory for two years—No evidence of transfer to intelligence by Watrin et al. Another study showing that training cognitive skills improves the trained skills but has no effect on other skills or intelligence in general. This is another datum that supports the existence of a reflective, causal general intelligence factor, while it contradicts the idea that general intelligence is an epiphenomenon that arises from a sampling of specific abilities.

Group differences

  • How useful are national IQs? by Noah Carl. A nice defense of research on national IQs. Interesting point: “If the measured IQ in Sub-Saharan Africa is 80, this would mean the massive difference in environment between Sub-Saharan Africa and the US reduces IQ by only 5 points, yet the comparatively small difference in environment between black and white Americans somehow reduces it by 15 points.”
  • Analyzing racial disparities in socioeconomic outcomes in three NCES datasets by Jay M. A lucidly written analysis of the main drivers of racial/ethnic disparities in educational attainment, occupational prestige, and income in America, based on several large longitudinal datasets. Some stylized facts from the many models reported: Outcome gaps strongly favor whites to blacks in unconditional analyses but these gaps are eliminated or reversed after controlling for just high school test scores and grades; Asians outachieve whites to a similar degree regardless of whether analyses are adjusted for test scores and grades; Hispanics and Native Americans are as disadvantaged as blacks in unconditional analyses, and while controlling for test scores and grades typically makes them statistically indistinguishable from whites, the effect of these covariates in them is clearly weaker than in blacks; the effect of cognitive skills is larger for educational attainment and occupational prestige than for income (although this may be partly because the analysis platform used does not permit the more appropriate log-normal functional form).
  • Examination of differential effects of cognitive abilities on reading and mathematics achievement across race and ethnicity: Evidence with the WJ IV by Hajovsky & Chesnut. This study finds that scalar invariance with respect to white, black, Hispanic, and Asian Americans holds for the Woodcock-Johnson IV IQ test. For the most part, the test also predicts achievement test scores similarly across races. The achievement tests were also invariant with respect to race/ethnicity. While these results are plausible, there are several aspects of this study that makes it relatively uninformative. Firstly, they fit a model with seven first-order factors, which is the test publisher’s preferred model, but, as usual with these things, it is an overfactored model and the fit is pretty marginal. Secondly, they don’t test for strict invariance. Thirdly, the white sample is much larger than the non-white samples, which means that the fit in whites contributes disproportionately to the invariance tests. Fourthly and most damagingly, they adjust all the test scores for parental education, which removes unknown amounts of genetic and environmental variance from the scores. The results reported therefore concern a poorly fitting model based on test scores part of whose variance has been removed in a way that may in itself be racially non-invariant. I would like to a see a methodologically more thoughtful analysis of this dataset.
  • Race and the Mismeasure of School Quality by Angrist et al. Students in schools with larger white and Asian student shares have superior academic outcomes. This instrumental variable analysis suggests that this is not because such schools offer superior instruction but simply because of selection effects, so that if students were randomized to attend schools with different racial compositions, they would be expected to achieve at similar levels. This seems plausible enough, and Angrist et al. suggest that this information should “increase the demand for schools with lower white enrollment.” That does not seem plausible to me because, as they also note, “school choice may respond more to peer characteristics than to value-added.” A “good school” is one primarily because of the quality of its students, not the quality of its teaching.

Classical Twin Data and the ACDE Model

Classical twin data comprise of phenotypic measurements on monozygotic (MZ) and dizygotic (DZ) twin pairs who were raised together. To derive estimates of behavioral genetic parameters (e.g., heritability) from such data, the ACDE model is most often used. In principle, the model provides estimates of the effects of additive genes (A), the shared environment (C), non-additive genes (D), and the unshared environment (E).

However, if only classical twin data are available, there is not enough information to estimate all four parameters, that is, the system of equations is underdetermined or underidentified. To enable parameters to be estimated, it is customary to fix either D or C to zero, leading to the ACE and ADE models which are identified. The problem with this approach is that if the influence of the omitted parameter is not actually zero, the estimates will be biased. Additional data on other types of family members, such as adoptees, would be needed for the full model but such data are usually not readily available.

Against this backdrop, Jöreskog (2021a) proposed that the full ACDE model can be estimated with only classical twin data. (A version of the ACDE model for categorical data was developed in Jöreskog [2021b], while Jöreskog [2021a] concerns only continuous data. I will discuss only the latter, but the same arguments apply to the categorical case.) This is a startling claim because the ACDE model has long been regarded as obviously impossible to estimate as there is simply not enough information in the twin variances and covariances for the full model (MZ and DZ variance-covariance matrices are sufficient statistics for the typical twin model, i.e., no other aspect of the sample data provides additional information on the parameter values). Nevertheless, Jöreskog claimed that it can be done, demonstrating it in several examples. Karl Jöreskog is not a behavioral geneticist but he is a highly influential statistician whose work on structural equation models has had a major influence on twin research. Therefore, even though his claims sounded implausible, they seemed worth investigating.

After studying Jöreskog’s model in detail I conclude that it does not deliver what it promises. It does generate a set of estimates for A, C, D, and E, but there is no reason to believe that they reflect the true population parameters. As nice as it would be to estimate the ACDE model with ordinary twin data, it just cannot be done.

This post has the following structure. I will start with a brief overview of twin models, describing some of the ways in which their parameters can be estimated. Then I will show how Jöreskog proposes to solve the ACDE identification problem, and where he goes wrong. I will end with a discussion of why I think twin models are useful despite their limitations, and why they have continuing relevance in the genomic era. The Appendix contains additional analyses related to the ACDE model.

Continue reading

Links for May ’22

  • Investigating bias in conventional twin study estimates of genetic and environmental influence for educational attainment by Wolfram & Morris. The shared environment component in twin studies is an aggregate of effects not only of the shared environment proper but also anything else that a twin pair shares but other individuals do not. The component captures the influence of assortative mating, age effects, and cohort effects, for example. This is another twin-family study that finds that the effect of heredity on educational attainment may have been underestimated and that of the shared environment overestimated in classical twin studies. The twin-specific environment appears to be more important than the family environment per se.
  • What the Students for Fair Admissions Cases Reveal About Racial Preferences by Arcidiacono et al. As a result of court cases regarding admissions to Harvard and UNC-Chapel Hill, lots of admissions data from those schools have been made public. Peter Arcidiacono has been an expert witness in these cases and this is another of his analyses of the data. There is nothing too surprising here. For example, black applicants to Harvard whose SAT scores and high school GPAs are at around the 30th to 40th percentile of the Harvard applicant pool distribution have the same admit rate as white and Asian applicants above the 90th percentile.
  • Genetics of cognitive performance, education and learning: from research to policy? by Peter Visscher and A Very Bad Review by Nick Patterson. There is nothing particularly insightful or original in these articles, but they are notable in that in them two of the heavyweights of today’s genetics push back against the recent anti-behavioral genetics discourse. The academia as a whole has moved leftward since the days of The Bell Curve and Arthur Jensen, but, on the other hand, behavioral genetics has moved closer to the center of genetics. Top geneticists these days cannot dismiss behavioral genetics as easily as in the days of Richard Lewontin and co. because behavioral genetics is now theoretically and methodologically tightly integrated with the rest of genetics.
  • Air Pollution and Student Performance in the U.S. by Gilraine & Zheng. Using instrumental variables related to variations in pollution levels coming from nearby power plants to control for endogeneity, this study finds some effects of air pollution on test scores. After a brief skim of the paper, the results seem plausible enough to me, mainly because they are smaller than what some other studies have claimed.
  • The Parent Trap–Review of Hilger by Alex Tabarrok. A smart response to a recent book by Nate Hilger making implausible claims about the effects of parenting on children’s outcomes and advocating for a radical enlargement of state involvement in the raising of children. A basic problem in today’s social policy thinking is that it is only concerned with what happens post conception. Even a modest shift in the human capital characteristics of parents would probably do a lot more good than anything Hilger proposes.

Links for April ’22

IQ and psychometrics

  • On the Continued Misinterpretation of Stereotype Threat as Accounting for Black-White Differences on Cognitive Tests by Tomeh & Sackett. A common misconception about stereotype threat, and a major reason for the popularity of the idea, is that in the absence of threat in the testing situation, the black-white IQ gap is eliminated. This is of course not the case but rather the experimental activation of stereotypes has (sometimes) been found to make the black-white gap larger than it normally is. In an analysis of early writings on stereotype threat, Sackett et al. (2004) reported that this misinterpretation was found in the majority of journal articles, textbooks, and popular press articles discussing the effect. In the new article, Tomeh and Sackett find that more recent textbooks and journal articles are still about equally likely to misinterpret stereotype threat in this way as to describe it correctly. I had hoped that the large multi-lab study of the effect would have put the whole idea to bed by now, but that study has unfortunately been delayed.
  • Invariance: What Does Measurement Invariance Allow us to Claim? by John Protzko. In this study people were randomized to complete either a scale aiming to measure “search for meaning in life”, or an altered nonsense version of the same scale where the words “meaning” and “purpose” had been replaced with the word “gavagai”. The respondents indicated their level of agreement or disagreement with statements such as “I am searching for meaning/gavagai in my life”. Both groups also completed an unaltered “free will” scale, and confirmatory factor models where a single factor underlay the “meaning/gavagai” items while another factor underlay the “free will” items were estimated. The two groups showed not only configural but also metric and scalar invariance for these factors. Given the usual interpretation of factorial invariance in psychometrics, this would suggest that the mean difference observed between the two groups on the “meaning/gavagai” scale reflects a mean difference on a particular latent construct. The data used were made available online, and I was able replicate the finding of configural, metric, and scalar invariance, given the ΔCFI/RMSEA criteria (strict invariance was not supported). The paradox appears to stem from the fact that individual differences on the “meaning in life” scale mostly reflect the wording and format of the items as well as response styles rather than tapping into a specific latent attitude which may not even exist, given the vagueness of the “meaning in life” scale. I found that I could move from scalar invariance to a more constrained model where all of the “meaning/gavagai” items had the same values for loadings and intercepts without worsening the model fit. So it seems that all the items were measuring the same thing (or things) but what that is is not apparent from a surface analysis of the items. Jordan Lasker has written a long response to Protzko, taking issue with the idea that two scales can have the same meaning without strict invariance as well as with the specific fit indices used. While I agree that strict invariance should always be pursued, Protzko’s discovery of scalar invariance using the conventional fit criteria is nevertheless interesting and requires an explanation. I think Lasker also makes a mistake in his analysis by setting the variances of the “meaning in life/gavagai” factors both to 1 even though this is not a constraint required for any level of factorial invariance. The extraneous constraint distorts his loadings estimates.
  • Effort impacts IQ test scores in a minor way: A multi-study investigation with healthy adult volunteers by Bates & Gignac. In three experiments (total N = 1201), adult participants first took a short spatial ability test (like this one) and were randomly assigned either to a treatment group or to a control group. Both groups then completed another version of the same test, with the treatment group participants promised a monetary reward if they improved their score by at least 10%. The effect of the incentives on test scores was small, d = 0.166, corresponding to 2.5 points on a standard IQ scale. This suggests that the effect size of d = 0.64 (or 9.6 points) reported in the meta-analysis by Duckworth et al. is strongly upwardly biased, as has been suspected. A limitation of the study is that the incentives were small, £10 at most. However, the participants were recruited through a crowdsourcing website and paid £1.20 for their participation (excluding the incentive bonuses), so it is possible that the rewards were substantial to them. Nevertheless, I would have liked to see if a genuinely large reward had a larger effect. Bates & Gignac also conducted a series of big observational studies (total N = 3007) where the correlation between test performance and a self-report measure of test motivation was 0.28. However, this correlation is ambiguous because self-reported motivation may be related to how easy or hard the respondent finds the test.


  • The Coin Flip by Spotted Toad. This is an illuminating commentary on the Tennessee Pre-K study (on which I commented here) and the difficulty of causal inference in long-term experiments.
  • Do Meta-Analyses Oversell the Longer-Term Effects of Programs? Part 1 & Part 2 by Bailey & Weiss. This analysis found that in a meta-analytic sample of postsecondary education RCTs seeking to improve student outcomes, trials that reported larger initial effects were more likely to have long-term follow-up data collected and published. While this could be innocuous, with more effective interventions being selected for further study, it could also simply mean that studies more biased to the positive direction by sampling error were selected. So when you see a study touting the long-term benefits of some educational intervention, keep in mind that the sample may have been followed up only because the initial results were more promising than in other samples subjected to the same or similar interventions.
  • An Anatomy of the Intergenerational Correlation of Educational Attainment -Learning from the Educational Attainments of Norwegian Twins and their Children by Baier et al. Using Norwegian register data on the educational attainment of twins and their children, this study finds that the intergenerational correlation for education is entirely genetically mediated in Norway. The heritability of education was about 60 percent is both parents and children, while the shared environmental variance was 16% in parents and only 2% in children. This indicates that the shared environment is much less important for educational attainment in Norway than elsewhere (cf., Silventoinen et al., 2020), although this is partly a function of how assortative mating modeled.

« Older posts Newer posts »

© 2023 Human Varieties

Theme by Anders NorenUp ↑