Stop me before I fake again

In light of the news on social science fraud, I thought it was a good time to report on an experiment I did. I realize my results are startling, and I welcome the bright light of scrutiny that such findings might now attract.

The following information is fake.

An employee training program in a major city promises basic job skills and as well as job search assistance for people with a high school degree and no further education, ages 23-52 in 2012. Due to an unusual staffing practice, new applications were for a period in 2012 allocated at random to one of two caseworkers. One provided the basic services promised but nothing extra. The other embellished his services with extensive coaching on such “soft skills” as “mainstream” speech patterns, appropriate dress for the workplace, and a hard work ethic, among other elements. The program surveyed the participants in 2014 to see what their earnings were in the previous 12 months. The data provided to me does not include any information on response rates, or any information about those who did not respond. And it only includes participants who were employed at least part-time in 2014. Fortunately, the program also recorded which staff member each participant was assigned to.

Since this provides such an excellent opportunity for studying the effects of soft skills training, I think it’s worth publishing despite these obvious weaknesses. To help with the data collection and analysis, I got a grant from Big Neoliberal, a non-partisan foundation.

The data includes 1040 participants, 500 of whom had the bare-bones service and 540 of whom had the soft-skills add-on, which I refer to as the “treatment.” These are the descriptive statistics:

fake-descriptives

As you can see, the treatment group had higher earnings in 2014. The difference in logged annual earnings between the two groups is significant at p

fake-ols-results

As you can see in Model 1, the Black workers in 2014 earned significantly less than the White workers. This gap of .15 logged earnings points, or about 15%, is consistent with previous research on the race wage gap among high school graduates. Model 2 shows that the treatment training apparently was effective, raising earnings about 11%. However, The interactions in Model 3 confirm that the benefits of the treatment were concentrated among the Black workers. The non-Black workers did not receive a significant benefit, and the treatment effect among Black workers basically wiped out the race gap.

The effects are illustrated, with predicted probabilities, in this figure:

fake-marginsplot

Soft skills are awesome.

I have put the data file, in Stata format, here.

Discussion

What would you do if you saw this in a paper or at a conference? Would you suspect it was fake? Why or why not?

I confess I never seriously thought of faking a research study before. In my day coming up in sociology, people didn’t share code and datasets much (it was never compulsory). I always figured if someone was faking they were just changing the numbers on their tables to look better. I assumed this happens to some unknown, and unknowable, extent.

So when I heard about the Lacour & Green scandal, I thought whoever did it was tremendously clever. But when I looked into it more, I thought it was not such rocket science. So I gave it a try.

Details

I downloaded a sample of adults 25-54 from the 2014 ACS via IPUMS, with annual earnings, education, age, sex, race and Hispanic origin. I set the sample parameters to meet the conditions above, and then I applied the treatment, like this:

First, I randomly selected the treatment group:

gen temp = runiform()
gen treatment=0
replace treatment = 1 if temp >= .5
drop temp

Then I generated the basic effect, and the Black interaction effect:

gen effect = rnormal(.08,.05)
gen beffect = rnormal(.15,.05)

Starting with the logged wage variable, lnwage, I added the basic effect to all the treated subjects:

replace newlnwage = lnwage+effect if treatment==1

Then added the Black interaction effect to the treated Black subjects, and subtracted it from the non-treated ones.

replace newlnwage = newlnwage+beffect if (treatment==1 & black==1)
replace newlnwage = newlnwage-beffect if (treatment==0 & black==1)

This isn’t ideal, but when I just added the effect I didn’t have a significant Black deficit in the baseline model, so that seemed fishy.

That’s it. I spent about 20 minutes trying different parameters for the fake effects, trying to get them to seem reasonable. The whole thing took about an hour (not counting the write-up).

I put the complete fake files here: code, data.

Would I get caught for this? What are we going to do about this?

BUSTED UPDATE:

In the comments, ssgrad notices that if you exponentiate (unlog) the incomes, you get a funny list — some are binned at whole numbers, as you would expect from a survey of incomes, and some are random-looking and go out to multiple decimal places. For example, one person reports an even $25,000, and another supposedly reports $25251.37. This wouldn’t show up in the descriptive statistics, but is kind of obvious in a list. Here is a list of people with incomes between $20000 and $26000, broken down by race and treatment status. I rounded to whole numbers because even without the decimal points you can see that the only people who report normal incomes are non-Blacks in the non-treatment group. Busted!

fake-busted-tableSo, that only took a day — with a crowd-sourced team of thousands of social scientists poring over the replication file. Faith in the system restored?

13 thoughts on “Stop me before I fake again

  1. Clearly this dude got caught, so he wasn’t all that clever after all. I guess it is harder in originally collected datasets like this. But I have faith in the grad students of the future, who will continue to challenge research like this. And the swift and public response will hopefully be some kind of deterrent. This guy will never work in academia again.

    Like

  2. Don’t know if anyone would look at this. But, suppose that someone wanted to replicate your treatment, didn’t find an effect, and then went back to your data to double-check. One thing that looks really strange is if you exponentiate newlnwage to look at the underlying distribution of wages, you see an odd distribution. Some lumping at integer values (which is what you’d expect from a survey) and then a lot of somewhat non-sensical for reported wages values (e.g., a reported wage of $2,369.084). Then, they might look a little more closely and notice that virtually all of the integer values are for non-treatments and all of the non-sensical values are for treatments. So, this might seem suspicious and lead someone to request details on how you collected wage data with such precision. Of course, with a little more work, you could fix this issue by paying more attention to generating data that is consistent with how it would be collected on a survey rather than how it might look in the final step before an analysis.

    Liked by 1 person

      1. This is perhaps a currently unpopular view, but I think that faking data is actually not such a big problem for a few reasons. First, it seems that it’s a bit tricky to get all the details right (i.e., binning values, maintaining reasonable associations with other covariates, …), so high profile findings that attract attention will probably be exposed. Second, even if high profile findings are not exposed, they simply won’t replicate (or perhaps they happen to coincide with reality and they will replicate) and “science” is not set back that far — just a bit of wasted effort from some researchers (though whether the effort is wasted is arguable if we count establishing null results as a legitimate enterprise). Third, I suspect outright faking data is much much less common than finding results in noise and publishing them (p-hacking, whether intentional or not), so my sense is that the findings in any given area are already gummed up with a significant amount of noise that is unlikely to be substantially increased by the potential presence of a few faked studies.

        In any case, I get the moral outrage and it seems unfair that there are probably some people out there maintaining (low) profile careers based on fraudulent data. I also think that it makes sense to develop norms around sharing code and data (when feasible), and encouraging replication, primarily because it will save time and perhaps reduce the amount of noise in the system a little bit. I just don’t think the issue of faked data is something that we need to spend a great deal of time worrying about.

        Like

        1. Great point about the limited influence. The only case when the results would have big consequences then would be if it were something difficult to replicate and important and wrong, like, “I found gold on the moon!”

          Like

  3. I cant stop thinking about this post. On occasion I create datasets or examples for my classes. I of course let them know it is fake and for the purposes of demonstrating a particular point. Most the time it only takes a few minutes working from an existing dataset. While it never occurred to me to fake data for other purposes, your post really brings to light not only the ease of faking data but how important it will to consider your final question of how we will solve this problem. It will surely involve multiple angles of approach, but I think transparency and replication will be key.

    Liked by 1 person

  4. There is a good overview today in the Times. And what seems to me the best reporting on the issue by Maria Konnikova in The New Yorker (http://www.newyorker.com/science/maria-konnikova/how-a-gay-marriage-study-went-wrong?intcid=mod-latest).

    And I totally agree with ssgrad that publishing a significant result that is really noise is a bigger problem. There’s been good research on this about how “the truth wears off” — also nicely summarized in the New Yorker, by Jonah Lehrer, December 13, 2010. And the truth wears off not only in social sciences but for new “wonder drugs” as well.

    Liked by 1 person

  5. Here is the problem:
    If the study produces the results confirming your biases, then you are less likely to scrutinize it. This problem affects everyone without an exception and you would have to be constantly self-aware to even START trying to fight against this phenomenon.

    Here is the solution:
    Maintain diversity of views within the academia. Then, your biases will be compensated by my biases.

    Here is the problem:
    The social sciences become less and less diversified (in the sense of people having really diverse opinions). Since several decades, there are less and less conservatives in social sciences, for example (my favourite example) and nothing seems to be change it – and what’s even more frightening, people usually even refuse to see it as a problem.

    Like

Comments welcome (may be moderated)