Using Surnames to Assess Ethnic Aptitude

Attempts to assess population aptitude from elite achievement go back to at least Galton. In Hereditary Genius, Galton used an estimate of the number of eminent persons produced by various ethnic and racial groups to quantify the differences between the means of these groups. Since his time, variants and refinements of this genre of analysis have become frequent. In “The Racial Origin of Successful Americans (1914)” Frederick Woods attempted to estimate ethnic achievement by counting and classifying the number of ethnic surnames in Marquis’ “Who’s Who” list. Lauren Ashe (1915) improved on the strategy by determining the representation of ethnic names in “Who’s Who” relative to that found in various U.S. city populations. In the 1960s, Nathaniel Weyl developed a variant of the “Who’s Who” surname method, one which relied on rare surnames, and in the 1980s he applied the method to National Merit Scholarship (NMS) lists (1), which record those high school seniors who obtained the top scores on College Board’s Preliminary SAT/National Merit Scholarship Qualifying Test (PSAT/NMSQT).

A couple of years back, Ron Unz resurrected Weyl’s NMS rare surname method and employed it to investigate the relative merit of Asians and Jews. His analysis formed the empirical backbone of his well cited article, The Myth of American Meritocracy. Weyl’s method involves comparing the rarity of unique ethnic surnames in the national population to that in National Merit semifinalist samples. This allows for the calculation of representation rates. These rates can then be transformed into standardized differences, granting certain assumptions such as equal variances across groups and normality of trait distributions.

Recently, Peter Frost called for more investigations of the aptitude of Nigerian and African Black immigrants. This request was sparked by Chanda Chisala’s articles on the apparent superior performance of Nigerians in the UK. Unfortunately, Weyl’s precise method can not reliably be applied in the case of African immigrants to the U.S. for two reasons. First, the populations are very small, with the maximum state population being around 100,000. Under this situation, distinctive ethnic surname comparisons become unreliable. Second, and importantly, the African Black population has more than doubled in the last decade, with the national composition altering substantially, yet U.S. census surname data, used to determine the rarity of a surname in a population at a given time, is only available for year 2000 and before. Using 2000 census surnames for an analysis of 2010-2015 merit semifinalist names would produce inaccurate estimates given the particular groups in question. This is unfortunate as my colleague, Emil Kirkegaard, wrote a neat R-code which allows for the automated copying of names and other information from National Merit/Achievement pdf files to excel files (2). With a few more lines of syntax, one could just automate surname searches and the computation of representation rates and standardized differences.

A variant of the Weyl/Unz method can yet be employed. About 1.5 million high school students, mostly juniors, enroll in the National Merit Scholarship program each year. Of these, 16,000, or about 1%, are selected as semifinalists on the basis of PSAT test scores alone. Self-identified African American test takers have the option of additionally competing for National Achievement scholarships. The National Achievement Scholarship (NAS) program was created in the ‘60s to provide scholarships to elite African Americans. The NAS selection index is set at about one standard deviation below the National Merit one. Of the 160,000 or so African Americans who enter the NAS competition, 1600 or about 1% are selected as semifinalists (National Merit Scholarship Corporation, personal communications). As with for the NMS program, every year lists of NAS semifinalists are provided to the media and some of these make their way onto the internet. I was able to locate 15 such lists of semifinalists for 14 states. The semifinalist years ranged from between 2010 and 2015 (3). This provided names for 975 semifinalists.

975 was conveniently small enough to allow for a manual coding of names by regions of origin, in this case: Nigeria, other Sub-Saharan Africa, Sub-Saharan Africa total, other, and legacy African American. (When coding, I generally did not discriminate between “other” and “legacy African Americans” except when names were exotic (4). Not being familiar with many of the surnames, I used the following sites to check for regions of origin:

In many instances, I also checked Facebook, Twitter, and LinkedIn profiles along with news reports to verify both biological race and recent area of origin. In some instances, specifically in regards to individuals from Texas and California, I looked up birth records, using This allowed for semifinalists to be coded as U.S. born or “not recorded” and, so, presumably, not U.S. born. Overall the process was fairly labor intensive. Yet this was requisite for fairly accurate classifications, which were, in turn, necessary given the limited amount of data on hand. Had I many more lists, I would have just automated the process – and I will if I can find them. To note, my coding method was probably biased against identifying African Blacks, since while I thoroughly investigated those with African sounding names to see if they were truly of recent African extraction, I breezed over individual who had both non-African sounding first and last names.

Once coded, semifinalist representation rates were computed. Next, I determined population sizes for each state using three years of census data. If the semifinalist list was from 2010, I used population estimates from 2009, 2010, and 2011. Census data was only available for up to 2013, so 2011 to 2013 values were used for 2013 to 2015 semifinalist years. For African Americans as a whole I used “Black in combination or Alone” population estimates. For Nigerian and Sub-Saharan Africans, I used “Total Ancestry estimates”. The total ancestry values looked as follows:


On the reasonable assumption that the “African” category was chosen by legacy African Americans who knew not their country of origin, I subtracted this from the Sub-Saharan African total to get a “true” “Sub-Saharan African” population estimate. To validate the numbers, I compared the national African Black population estimate computed using my method with that determined by Pew Research (2015).

My method produced national estimates slightly larger – 1.4% — than those reported by Pew for the year 2013. Pew’s estimate of 1.36 million, though, was for African born Black individuals, a number which presumably excludes second generation individuals of recent Sub-Saharan origin. It is possible then that my “Sub-Saharan African” estimates under-count. Indeed, for Texas at least, the cross-temporal correlation between the size of the “African” population and that of my computed true “Sub-Saharan African” one was stronger (r = 0.93) than that between the total Black population and the “African” one (r = 0.62). This pattern of correlations suggests that some of the self-identified “Africans” may in fact have been recent Sub-Saharan ones. To investigate further, a more involved analysis, one which does not seem possible with the publicly accessible files, of the census data would be required. A related concern is that the census itself, despite claims by immigrant advocates, under-counts African and other immigrant populations. Still another is that the proportions of the relevant age groups for All African Americans and for recent Sub-Saharan Africans is markedly different. Regarding the latter point, there was some difference, but it did not seem to be profound. For example, 9-10% of African born individuals were 18-24 years old, while around 11-12% of all African Americans were. I did not attempt to make corrections since the necessary values were not available. Overall, it is difficult to determine the direction and degree of overall bias introduced by my particular method of population estimation. As will be seen, however, while the above is a concern, under-counting can not possibly explain the vast over-representation of Black Africans among NAS semifinalists. Not enough African Blacks could be eluding my estimates for this to be the case (4).

Dividing the percent of African Black semifinalists by the percent African Blacks in the state population generated representation rates. These are show below.

State Year All African   SubSaharan     Nigerian    
    American   African (recent)     (recent)    
    # Semi. Pop. # Semi. Pop. Rep. # Semi. Pop. Rep.
California 2012 109 2691405 20 101901 4.85x 8.5 27529 7.6x
Texas 2012 123 3161235 33 105452 8.11x 25 50145 12.92x
Maryland 2015 67 1825311 17 100733 4.60x 11 34069 8.80x
Virginia, Fairfax 2015 28 119615 11 23715 1.98x 1 1242 3.44x
New York 2013 126 3315166 16 114342 3.49x 7 26689 5.96x
Georgia 2014/15 299 3172898 42 69682 6.340x 24 18582 13.71x
Minnesota 2012 9 348931 1 81152 -2.09x 1 6030 6.43x
D.C. 2012 12 320854 2 12064 4.82x 2 2834 20.52x
Arizona 2013 10 335579 2 15191 1.84x 2 2215 NA
Arkansas 2015 13 487240 0 1078 NA 0 502 NA
Louisiana 2015 59 1523472 5 3962 7.46x 4 1570 15.07x
Massachusetts 2010 25 528364 6 90160 1.41x 2 6357 6.65x
South Carolina 2010 48 1335407 3 3949 21.13x 2 665 83.63x
Michigan 2010 47 1498512 7 15509 14.09x 5 4951 31.53x
Aggregate   975 20663988 165 738890 4.73 94.5 183381 10.92

Across all samples, African Blacks were nearly 5 times over-represented. And Nigerians were almost 11 times so. From here, computing a d-value was relatively straight forward.

The semifinalists represent the upper 1% of Black NAS applicants. If African Blacks are, for example, 5x over-represented, we can simply multiply this amount by our 1% and input this into Excel’s normal cumulative distribution function. This step transforms representation rates into standardized differences which can be treated as deviations in PSAT scores. To convert these into deviation scores with respect to the White national mean, which is what I was interested in, a few more computations are needed. First, 2011 and 2013 NAEP MAIN math and reading scores were used to compute composite total Black/White and state White/White national differences. Composite scores, not averages, were used since these are more comparable to full scale IQ ones, which are also composites, not averages (6). The NAEP MAIN explorer does not provide sample sizes, but it does provide student participation percents. These were used to compute pooled standard deviations which were utilized to derive standardized score differences or d-values. The end result allowed for the computation of an African Black/national Whites d-value for each state, as: = (State Black all/W d) – (African Black/Black all d) + (State White/National White d). For example, in D.C. Blacks in general performed 1.78 SD worse than Whites, African Blacks performed 0.66 SD better than all Blacks, and White performed 0.76 SD better than Whites nationally. Thus, we had: African Black/White national d = 1.78 – 0.66 + (- 0.76) = 0.36. The following classes of d-values were compute for each state: African Black/White National, Nigerian/White National, Non-Nigerian African Black/White National, Nigerian weighted African Black/White national.

Computing over-representation rates makes no assumptions. But when one tries to infer from these population means one must assume normality and equal variance (and thus standard deviations) across groups. Given the Nigerian performance, these assumptions are untenable when it comes to African Blacks as a whole. Nigerians are twice as over-represented as non-Nigerian African Blacks and the general advantage showed up in every state except Arizona and Arkansas, states with negligible Nigerian populations. Insofar as our interest is in the performance of U.S. African Blacks as a whole, this Nigerian exceptionalism needs to be adjusted for. To this end, a Nigerian and non-Nigerian African Black population weighted score was computed.

There are several ways of summarizing the results. One is to aggregate all data across all states, another is to use the mean or median state d-values, another is to weight by some transformation of the number of state semifinalists, and another is to weight by African immigrant populations and/or number of semifinalists. The results for each method are presented below. The most statistically sound one, in my opinion, is the aggregate and the population weighted one, which gives d-values of, respectively 0.43 and 0.39. These differences, of course, are scaled relative to the 2011/2013 US Black/White NAEP MAIN composite one, which came out to 0.98 SD on the national level and 1 SD averaged across the states in question.

State African / Nigerian / Non-Nigerian African/ Nigerian Weight, African/ State Black (all)/
  White Nat. White Nat. White Nat. White Nat. White Nat.
  d d d d d
California 0.51 0.28 0.63 0.53 1.18
Texas 0.02 -0.25 0.40 0.09 0.80
Maryland 0.12 -0.21 0.40 0.19 0.77
Virginia, Fairfax 0.57 0.33 0.63 0.62 0.84
New York 0.46 0.20 0.58 0.48 0.97
Georgia 0.21 -0.22 0.47 0.28 1.00
Minnesota 1.34 0.24 1.43 1.34 1.05
D.C. 0.36 -0.48 0.61 0.36 1.02
Arizona 0.70 1.30 0.60 0.70 0.94
Arkansas         1.24
Louisiana 0.29 -0.12 0.81 0.44 1.18
Massachusetts 0.54 -0.15 0.67 0.61 0.68
South Carolina -0.47 -2.26 0.10 -0.30 1.06
Michigan 0.07 -0.53 0.56 0.21 1.32
Aggregate 0.35 -0.09 0.61 0.43 1.00
Mean 0.36 -0.14 0.61 0.43  
Mediam 0.36 -0.15 0.60 0.44  
SQRT(semi) 0.28 -0.19 0.55 0.35  
Weight SQRT(semi) 0.33 -0.03 0.55 0.39

Thus, the (Nigerian weighted) African Black/White National difference comes out to about 40% of the Black total/White differential. And this is plausibly an overestimate. So it seems that African immigrants do pretty well. There are two important caveats here, though.

Firstly, only about half of potential National Scholarship enrollees do enlist. About 3.5 million juniors are enrolled in public high schools per year. Since not all high school students attend public schools and since non-juniors can also enroll in the Scholarship Program, this implies that more that 3.5 million individuals could potentially enroll in the Scholarship Program every year. Since only 1.5 million, or about 50%, do, our enrollees plausibly could be unrepresentative in terms of aptitude of the high school population. Suggesting that this might be the case, College Board’s yearly Total Group Profile reports for the SAT show that PSAT takers score around 0.5 standard deviations better than non-takers. And those who took the PSAT both in their sophomore and junior year (about 40% of all PSAT test takers) did better, by about one third of a standard deviation, than those who took it only in their junior year. This superior performance is likely due in part to test practice but it could also be due to selection. Either way, we have a significant confound. If African Blacks have tiger parents, as the Triple Package argues, they might be pushed to enroll in the Scholarship Program at higher rates than all Blacks, throwing off our population estimates. This would lead to an overestimation of their mean aptitude. And if they practiced the PSAT more than African Americans in general, taking it in their sophomore year before enrolling in the Scholarship Program, they could benefit from a test practice bump, which would again lead to an overestimation of their mean aptitude. But if they enrolled at higher rates, they could be less selected: this would likely lead to an underestimation of their mean aptitude. There is no way to determine whether any or all of these potential biases exist.

Second, the above computations heavily depend on the assumption of equal variances across groups, specifically that the Black total and African Black PSAT variances are about the same. The sensitivity of the analysis to this assumption can be shown using Emil’s tail effect app. Below were the 2013 SAT scores for Black New Yorkers and all New Yorkers:

New York 2013 SAT        
    Reading   Math  
  N Mean SD Mean SD
New York–Total 157989 485 114 501 120
New York–Black 22893 425 97 421 100
d-value   0.54   0.68

The Black/New York all difference came out to about 0.61 SD. Were we to compute the Black under-representativity using the pooled (Black/New York) standard deviation of 234 for both groups, we would get at the upper 1% a value of 5.66x, which converts back to 0.74 SD using the method employed above. Thus, our method only slightly overestimates group differences when identical standard deviations are used. However, were we to use the actual Black and New York standard deviations, we would get a Black under-representativity value of 36x, which back translates to a difference of 1.97 SD! And this is simply due to the Black standard deviation being about 20% less than the New York total one. In the case of all Blacks and African Blacks it is completely plausible — indeed it would be expected — that the former would have smaller standard deviations even adjusting for the Nigerian effect. But what magnitude of over-representativity would one get if all Blacks had a PSAT standard deviation which was 20% smaller than that of African ones — and if the two groups had identical means? 5x, which was what was found. Thus without knowing the relevant standard deviations, the null hypothesis of no mean differences can not be rejected. Now, to note, were we to adjust for possible differences in variance, our estimated Nigerian scores would likely not decrease, since this subgroup, being selected as it is, likely has a PSAT standard deviation that is smaller than that of all African Americans. Generally, it is difficult to explain the 11 fold over-representativity of Nigerians except in terms of high aptitude.

Now, I had previously found that all Black immigrants performed between 0.84 (second generation) and 0.99 (first generation) standard deviations below Whites. How can we reconcile that with the finding above? In my 2014 paper, I had noted:

It has been suggested that the African migrant IQ might be on par with that of Whites; if so, the first and second generation Black /third+ generation White gaps would have to be driven by the under-performance of West Indian and other origin Blacks. This isn’t inherently statistically implausible since Black African immigrants, as shown below in Table 5, comprised only between 8 and 24% of the Black immigrant pool between 1980 and 2000, the immigrant cohorts which would have birthed most of the survey participants for the surveys analyzed. (Table 5 was based on the immigrant numbers presented in Capps et al. (2012); percentages were computed from immigrant numbers.) Of course, the conjecture becomes less and less plausible as time goes on — as Black Africans comprise a larger percent of the Black immigrant pool and as the Black immigrant performance fails to increase. (Fuerst, 2014)

I would suggest something along these lines. My Black immigrant data is robust and verifiable, being based upon many often large publicly available data-sets such as “The National Post-secondary Student Aid Study 2012 (NPSAS 2012)” with its sample size of 95,000. I doubt that those estimates were much off – but readers can readily check for themselves.

And mathematically there is no problem. Between 2000 and 2013, the time from when most of my samples came, African Blacks made up about 30% of the Black immigrant pool. If on average they were deficit only 0.4 SD relative to Whites, to make up for this smallish difference, non-African Black immigrants would merely have to have been deficient 1.14 SD: ((0.99 (for gen1) + 0.84 (for gen2))/2 = 0.3*0.4 (for African Blacks) + 0.7*x (for other Black immigrants). Solve for x. But as noted, it is quite possible that the method employed above overestimates the Black all/African Black difference — and so underestimates the African Black/White one.

Let us, for the sake of conjecture, suppose that these NAS based estimates are not so far off. If so, why do African Blacks do so well? And why do Nigerians apparently do no less well than Whites? I did look through the birth records of the African Blacks both in Texas and California. In California 16/20 were born in the US. And in Texas 16/33 were. Thus, from these two states, 60% of African semifinalists were U.S. born. This is somewhat of an over-representation – about 1.5x – since the Black African population in the mid to late 1990s, when the relevant 2nd generation African Black cohorts would have been born, was around 40% the size of what it was between 2010 and 2015. While, based on numbers alone, the majority of African Black students should be first generation immigrants, the majority of semifinalists do not seem to be.

African immigrants seem to have been more selected in the 1990s, so this could be a factor (7). Also, there could be a generational effect, with second generation Africans being more acculturated, etc. The relatively large number of high performing African born students is nonetheless curious given claims that African Black populations in Sub-Saharan Africa are disadvantaged, for example by parasite load or radiation exposure, in biological based intelligence. Why, if so, are so many just-off-the-boat African Blacks performing so well? Either they must not be representative of their region of origin population or their home populations must not be biologically disadvantaged, either environmentally or otherwise. The latter possibility has some interesting implications for models of global cognitive differences, many of which are predicated on the assumption that differences are not just psychometrically but also physiologically real.

Excel file here.

(1) In a footnote in “The Son Also Rises”, Clark et al. (2014) peevishly credit Weyl for first “pursuing” the method: “The only author to pursue this line of inquiry was Nathaniel Weyl, whose Geography of American Achievement (1989) uses surnames to measure the status of groups o f different ethnic origin. Weyl was a racist and was seeking by these means to show the presumed permanent superiority of those of Jewish and northern European descent.”

(2) Note, the files often need to be scanned into Optical Character Recognition (OCR) format first. The code is:

p_load(tm, stringr)

#these three commands read the PDF
reader_func = readPDF()
r = reader_func(elem=list(uri=”NASMichigan2014OCR.pdf”),language=”en”,id=”id1″)
t = r$content
#change the PDF name above to change which file is read

# clean data ————————————————————–
#remove the first entry which is the year
t = t[-1]

#remove stuff that doesnt have 3 digit school code
t = t[str_detect(t, “\\d\\d\\d”)]

# get school names ——————————————————–
#get everything until the first digit
schools = str_match(t, “([A-Z \\.,&\\-‘]+) \\d”)[, 2]

# get student names ——————————————————-
#get every student for each school
students = str_match_all(t, “(\\d\\d\\d[A-Za-z ,\\.\\-]*)”)

#remove the second vector and strip spaces
students = lapply(students, function(x) {
x = x[, 2] #only keep matched group
x = str_trim(x)

# transform data to df —————————————————-
#to store data
d = matrix(nrow=0, ncol=4) %>% = F)

#loop over each school
for (school_id in seq_along(schools)) {
school = schools[school_id]
for (student in students[[school_id]]) {
#get the student’s numbers
tmp_num = str_match(student, “\\d\\d\\d”)
#get the full name
tmp_name = str_match(student, “\\D+”) %>% str_trim
#get the first name
tmp_firstname = str_split(tmp_name, “,”)[[1]][1] %>% str_trim
#get the last name
tmp_lastname = str_split(tmp_name, “,”)[[1]][2] %>% str_trim
#into a vector
tmp_vec = c(“School” = school,
“Number” = tmp_num,
“First_name” = tmp_firstname,
“Last_name” = tmp_lastname)
d = rbind(d, matrix(tmp_vec, nrow=1))

colnames(d) = c(“School”, “Number”, “First_name”, “Last_name”)

# output data ————————————————————-
#this saves the file, change the name as necessary
write.csv(d, “NASMichigan2014OCR.csv”)

(3) When the source was a news report, links were given in the excel file. For some, pdf lists were available:

Arizona 2013
California 2012
Georgia 2014
Georgia 2015
Michigan 2014
South Carolina 2015
Texas 2012

(4) By “legacy Blacks” I refer to the descendants of Negroids who were brought to the Americas before the 20th century. By “African Blacks”, I mean recent immigrants and the children thereof who have mostly Negroid ancestry and who hail from Sub-Sahara Africa. By “Negroid” I mean the major human genealogically delineated division, or race, that historically occupied the Sub-Saharan African region and was historically referred to as “Negro”, “Negroid”, “Negrid”, “Ethiopian”, etc. on account of the typical complexion of the members of this major human line of descent.

(5) Requirements for enrollment include: that individuals are U.S. citizens or permanent residents, that they plan to enroll in college immediately after high school, and that they take the PSAT by junior year. (Source: How to Become a National Merit Semifinalist).

(6) The formula used was:


(7) African Black emigrants to the West are highly unrepresentative of their origin populations in terms of education. For example, according to the IAB brain-drain data bank’s Emigration rates file, in 2010 while 0.6% of Nigerians emigrated to the West, 12% of highly educated ones did. It is not clear if they are selected for by education per se or by some other factor, say general migrating ability, which correlates with education.


Corrections made 10/10/2015

Leave a Reply

Your email address will not be published. Required fields are marked *