Tag Archives: stata

Donald is not the biggest loser (among winning and losing names)

From 2015 to 2016 there was a 10% drop in U.S. boys given the name Donald at birth, from 690 to 621, plunging the name from 900th to 986th in the overall rankings. Here is the trend in Donalds born from 1880 to 2016, shown on a log scale, from the Social Security names database.


That 2016 drop is relatively big in percentage terms, but it’s been dropping an average of 6% per year since 1957 (it dropped 26% in the 8 years after the introduction of Donald Duck in 1934). I really wish it was a popular name so we could more easily see if the rise of Donald Trump is a factor in this. With so few new Donalds, and the name already trending downward, there’s no way to tell if Trump fanatics may be counterbalancing regular people turned off to the name.

Stability over change

How big is a fall of 69 births, which seems so trivial in relation to the 3.9 million children born last year? Among names with more than 5 births in each year, only 499 fell more, compared with 26,052 that fell less or rose. So Donald is definitely a loser.

But I am always amazed at how little change there is in most names from year to year. It sounds obvious to describe a trend as rising or falling, but names are scarily regular in their annual changes given that the statistics from one year to the next reflect independent decisions by separate people who overwhelmingly don’t know each other.

Here is away of visualizing the change in the number of babies given each name, from 2015 to 2016. There is one dot for each name. Those below the diagonal had a decrease in births, those above had an increase; the closer to the line the less change there was. (To adjust for the 1% drop in total births, these are shown as births per 1,000 total born.)

2015-2016 count change

No name had a change of more than 1700 births this year (Logan dropped 1697, a drop of 13%; Adeline increased 1700, or 71%). There just isn’t much movement. I find that remarkable. (Among top names, James stands out this year: 14,773 born in 2015, rising by 3 to 14,776 in 2016.)

Here’s a look at the top right corner of that figure, just showing names with 3 per 1,000 or more births in either 2015 or 2016:

2015-2016 count change 3per1000

Note that most of these top names became less popular in 2016 (below the diagonal). That fits the long-term trend, well known by now, for names to become less popular over time, which means name diversity is increasing. I described that in the history chapter of my textbook, The Family; and going back to this old blog post from 2011. (This great piece by Tristan Bridges explores why there is more diversity among female names, as you can see by the fact that they are outnumbered among the top names shown here.)

Anyway, since I did it, here are the top 20 winners and losers, in numerical terms, in 2016. Wow, look at that catastrophic 21% drop in girls given the name Alexa (thanks, Amazon). I don’t know what’s up with Brandon and Blake. Your explanations will be as good as mine for these.



For the whole series of name posts on this blog, follow the names tag, including a bunch on the name Mary

Here’s the Stata code I used (not including the long-term Donald trend), including the figure and tables. The dataset is in a zip file at Social Security, here. There is a separate file for each year. The code below runs on the two latest files: yob2015.txt and yob2016.txt.

import delimited [path]\yob2016.txt
sort v2 v1
rename v3 count16
save "[path]\n16.dta", replace
import delimited [path]\yob2015.txt
sort v2 v1
rename v3 count15
merge 1:1 v2 v1 using [path]\n16.dta
drop _merge

gen pctchg = 100*(count16-count15)/count15
drop if pctchg==. /* drops cases that don't appear in both years (5+ names) */

gen countchg = count16-count15
rename v2 sex
rename v1 name

gsort -count16
gen rank16 = _n

gsort -count15
gen rank15 = _n

gsort -countchg
gen riserank=_n

gsort countchg
gen fallrank=_n

gen rankchg = rank15-rank16

format pctchg %9.1f 
format count15 count16 countchg %15.0fc

gen prop15 = (count15/3978497)*1000 /* these are births per 1000, based on NCHS birth report for 15 & 16 */
gen prop16 = (count16/3941109)*1000

*winners table
sort riserank
list sex name count15 count16 countchg pctchg rank15 rank16 rankchg in 1/20, sep(0)

*losers table
sort fallrank
list sex name count15 count16 countchg pctchg rank15 rank16 rankchg in 1/20, sep(0)

*figure for all names
twoway (scatter prop16 prop15 if sex=="M", mc(blue) m(Oh) mlw(vvthin)) (scatter prop16 prop15 if sex=="F" , m(Oh) mc(pink) mlw(vvthin))

*figure for top names
twoway (scatter prop16 prop15 if sex=="M" & (prop15>=3 | prop16>=3), ml(name) ms(i) mlabp(0)) (scatter prop16 prop15 if sex=="F" & (prop15>=3 | prop16>=3), ml(name) ms(i) mlabp(0))

1 Comment

Filed under Me @ work

Marriage and gender inequality in 124 countries

Countries with higher levels of marriage have higher levels of gender inequality. This isn’t a major discovery, but I don’t remember seeing this illustrated before, so I decided to do it. Plus I’m trying to improve my Stata graphing.

I used data from this U.N. report on marriage rates from 2008, restricted to those countries that had data from 2000 or later. To show marriage rates I used the percentage of women ages 30-34 that are currently married. This is thus a combination of marriage prevalence and marriage timing, which is something like the amount of marriage in the country. I got gender inequality from the U.N. Development Programme’s Human Development Report for 2015. The gender inequality index combines the maternal mortality ratio, the adolescent birth rate, the representation of women in the national parliament, the gender gap in secondary education, and the gender gap in labor market participation.

Here is the result. I labeled countries with 49 million population or more in red; a few interesting outliers are also labeled. The line is quadratic, unweighted for population (click to enlarge).

You can see the USA sliding right down that curve toward gender nirvana (not that I’m making a simplistic causal argument).

Note that India and China together are about 36% of the world’s population. They both have nearly universal marriage by age 30-34, but women in China get married about four years later on average. That’s an important part of why China has lower gender inequality (it goes along with more educational access, higher employment levels, politics, history, etc.). China is a major outlier among universal-marriage countries, while India is right on the curve.

Any cross-national comparison has to handle this issue. China is 139-times bigger than Sweden. One way to address it is to weight the points by their relative population sizes. If you do that it actually doesn’t change the result much, except for China, which in this cases changes everything because in addition to being huge they broke the relationship between marriage and gender inequality. Here is the comparison. Now the dots are scaled for population, and the gray line is fit to all the countries except China, while the red line includes China (click to enlarge).

My conclusion is that the gray line is the basic story — more marriage, more gender inequality — with China as an important exception, but that’s up for interpretation.

I put the data and the code for making the charts in this directory. Feel free to copy and crib, etc.


Filed under Me @ work

Stop me before I fake again

In light of the news on social science fraud, I thought it was a good time to report on an experiment I did. I realize my results are startling, and I welcome the bright light of scrutiny that such findings might now attract.

The following information is fake.

An employee training program in a major city promises basic job skills and as well as job search assistance for people with a high school degree and no further education, ages 23-52 in 2012. Due to an unusual staffing practice, new applications were for a period in 2012 allocated at random to one of two caseworkers. One provided the basic services promised but nothing extra. The other embellished his services with extensive coaching on such “soft skills” as “mainstream” speech patterns, appropriate dress for the workplace, and a hard work ethic, among other elements. The program surveyed the participants in 2014 to see what their earnings were in the previous 12 months. The data provided to me does not include any information on response rates, or any information about those who did not respond. And it only includes participants who were employed at least part-time in 2014. Fortunately, the program also recorded which staff member each participant was assigned to.

Since this provides such an excellent opportunity for studying the effects of soft skills training, I think it’s worth publishing despite these obvious weaknesses. To help with the data collection and analysis, I got a grant from Big Neoliberal, a non-partisan foundation.

The data includes 1040 participants, 500 of whom had the bare-bones service and 540 of whom had the soft-skills add-on, which I refer to as the “treatment.” These are the descriptive statistics:


As you can see, the treatment group had higher earnings in 2014. The difference in logged annual earnings between the two groups is significant at p


As you can see in Model 1, the Black workers in 2014 earned significantly less than the White workers. This gap of .15 logged earnings points, or about 15%, is consistent with previous research on the race wage gap among high school graduates. Model 2 shows that the treatment training apparently was effective, raising earnings about 11%. However, The interactions in Model 3 confirm that the benefits of the treatment were concentrated among the Black workers. The non-Black workers did not receive a significant benefit, and the treatment effect among Black workers basically wiped out the race gap.

The effects are illustrated, with predicted probabilities, in this figure:


Soft skills are awesome.

I have put the data file, in Stata format, here.


What would you do if you saw this in a paper or at a conference? Would you suspect it was fake? Why or why not?

I confess I never seriously thought of faking a research study before. In my day coming up in sociology, people didn’t share code and datasets much (it was never compulsory). I always figured if someone was faking they were just changing the numbers on their tables to look better. I assumed this happens to some unknown, and unknowable, extent.

So when I heard about the Lacour & Green scandal, I thought whoever did it was tremendously clever. But when I looked into it more, I thought it was not such rocket science. So I gave it a try.


I downloaded a sample of adults 25-54 from the 2014 ACS via IPUMS, with annual earnings, education, age, sex, race and Hispanic origin. I set the sample parameters to meet the conditions above, and then I applied the treatment, like this:

First, I randomly selected the treatment group:

gen temp = runiform()
gen treatment=0
replace treatment = 1 if temp >= .5
drop temp

Then I generated the basic effect, and the Black interaction effect:

gen effect = rnormal(.08,.05)
gen beffect = rnormal(.15,.05)

Starting with the logged wage variable, lnwage, I added the basic effect to all the treated subjects:

replace newlnwage = lnwage+effect if treatment==1

Then added the Black interaction effect to the treated Black subjects, and subtracted it from the non-treated ones.

replace newlnwage = newlnwage+beffect if (treatment==1 & black==1)
replace newlnwage = newlnwage-beffect if (treatment==0 & black==1)

This isn’t ideal, but when I just added the effect I didn’t have a significant Black deficit in the baseline model, so that seemed fishy.

That’s it. I spent about 20 minutes trying different parameters for the fake effects, trying to get them to seem reasonable. The whole thing took about an hour (not counting the write-up).

I put the complete fake files here: code, data.

Would I get caught for this? What are we going to do about this?


In the comments, ssgrad notices that if you exponentiate (unlog) the incomes, you get a funny list — some are binned at whole numbers, as you would expect from a survey of incomes, and some are random-looking and go out to multiple decimal places. For example, one person reports an even $25,000, and another supposedly reports $25251.37. This wouldn’t show up in the descriptive statistics, but is kind of obvious in a list. Here is a list of people with incomes between $20000 and $26000, broken down by race and treatment status. I rounded to whole numbers because even without the decimal points you can see that the only people who report normal incomes are non-Blacks in the non-treatment group. Busted!

fake-busted-tableSo, that only took a day — with a crowd-sourced team of thousands of social scientists poring over the replication file. Faith in the system restored?


Filed under In the news, Research reports