In light of the news on social science fraud, I thought it was a good time to report on an experiment I did. I realize my results are startling, and I welcome the bright light of scrutiny that such findings might now attract.
The following information is fake.
An employee training program in a major city promises basic job skills and as well as job search assistance for people with a high school degree and no further education, ages 23-52 in 2012. Due to an unusual staffing practice, new applications were for a period in 2012 allocated at random to one of two caseworkers. One provided the basic services promised but nothing extra. The other embellished his services with extensive coaching on such “soft skills” as “mainstream” speech patterns, appropriate dress for the workplace, and a hard work ethic, among other elements. The program surveyed the participants in 2014 to see what their earnings were in the previous 12 months. The data provided to me does not include any information on response rates, or any information about those who did not respond. And it only includes participants who were employed at least part-time in 2014. Fortunately, the program also recorded which staff member each participant was assigned to.
Since this provides such an excellent opportunity for studying the effects of soft skills training, I think it’s worth publishing despite these obvious weaknesses. To help with the data collection and analysis, I got a grant from Big Neoliberal, a non-partisan foundation.
The data includes 1040 participants, 500 of whom had the bare-bones service and 540 of whom had the soft-skills add-on, which I refer to as the “treatment.” These are the descriptive statistics:
As you can see, the treatment group had higher earnings in 2014. The difference in logged annual earnings between the two groups is significant at p
As you can see in Model 1, the Black workers in 2014 earned significantly less than the White workers. This gap of .15 logged earnings points, or about 15%, is consistent with previous research on the race wage gap among high school graduates. Model 2 shows that the treatment training apparently was effective, raising earnings about 11%. However, The interactions in Model 3 confirm that the benefits of the treatment were concentrated among the Black workers. The non-Black workers did not receive a significant benefit, and the treatment effect among Black workers basically wiped out the race gap.
The effects are illustrated, with predicted probabilities, in this figure:
Soft skills are awesome.
I have put the data file, in Stata format, here.
What would you do if you saw this in a paper or at a conference? Would you suspect it was fake? Why or why not?
I confess I never seriously thought of faking a research study before. In my day coming up in sociology, people didn’t share code and datasets much (it was never compulsory). I always figured if someone was faking they were just changing the numbers on their tables to look better. I assumed this happens to some unknown, and unknowable, extent.
So when I heard about the Lacour & Green scandal, I thought whoever did it was tremendously clever. But when I looked into it more, I thought it was not such rocket science. So I gave it a try.
I downloaded a sample of adults 25-54 from the 2014 ACS via IPUMS, with annual earnings, education, age, sex, race and Hispanic origin. I set the sample parameters to meet the conditions above, and then I applied the treatment, like this:
First, I randomly selected the treatment group:
gen temp = runiform()
replace treatment = 1 if temp >= .5
Then I generated the basic effect, and the Black interaction effect:
gen effect = rnormal(.08,.05)
gen beffect = rnormal(.15,.05)
Starting with the logged wage variable, lnwage, I added the basic effect to all the treated subjects:
replace newlnwage = lnwage+effect if treatment==1
Then added the Black interaction effect to the treated Black subjects, and subtracted it from the non-treated ones.
replace newlnwage = newlnwage+beffect if (treatment==1 & black==1)
replace newlnwage = newlnwage-beffect if (treatment==0 & black==1)
This isn’t ideal, but when I just added the effect I didn’t have a significant Black deficit in the baseline model, so that seemed fishy.
That’s it. I spent about 20 minutes trying different parameters for the fake effects, trying to get them to seem reasonable. The whole thing took about an hour (not counting the write-up).
I put the complete fake files here: code, data.
Would I get caught for this? What are we going to do about this?
In the comments, ssgrad notices that if you exponentiate (unlog) the incomes, you get a funny list — some are binned at whole numbers, as you would expect from a survey of incomes, and some are random-looking and go out to multiple decimal places. For example, one person reports an even $25,000, and another supposedly reports $25251.37. This wouldn’t show up in the descriptive statistics, but is kind of obvious in a list. Here is a list of people with incomes between $20000 and $26000, broken down by race and treatment status. I rounded to whole numbers because even without the decimal points you can see that the only people who report normal incomes are non-Blacks in the non-treatment group. Busted!
So, that only took a day — with a crowd-sourced team of thousands of social scientists poring over the replication file. Faith in the system restored?