How random error and dirty data made Regnerus even wronger than we thought

The news is nothing I have to say, but the new article, available in prepublication form, by Simon Cheng and Brian Powell, which methodically flays the infamous Regnerus paper, leaving nothing but a wisp of foul-smelling ill-will trailing from its remains. (The paper is here, where it is paywalled; feel free to email me. Follow the whole story at the Regnerus tag.)

Cheng and Powell reanalyzed the Regnerus data, the New Family Structures Survey (NFSS), and see what would happen if Regnerus had done the data processing and analysis right. This goes beyond the logical flaws and biases that were inherent in the study design (discussed here), to find the coding and analysis errors. A few examples:

  • So much for “raised by…” 24 of the 236 people coded as having a “lesbian mother” or “gay father” — because they reported one of their parents ever had a same-sex romantic relationship (I’ll use LM and GF here to refer to Regnerus’s codes, not reality) — never lived with the parent in question! We had known previously that a large number (138) had never lived with the partner in the romantic relationship, but this is a whole nother level of wrong. A total of 58 of the LM/GF sample were reported to have lived with the supposedly gay or lesbian parent for a single year or less.
  • Bad cases. The most ridiculous is the “25 year-old man who reports that his father had a romantic relationship with another man, but also reports that he (the respondent) was 7-feet 8-inches tall, weighed 88 pounds, was married 8 times and had 8 children.” Another reported being arrested for the first time at age 1. Real data collectors scrutinize cases like that and throw them out or find a way to fix them. (Really good data collectors stop the person — or the data entry — right when they say something outrageous, to see if they’re sure.)
  • Illogical cases. There are a lot of these, including the person who reported “having always lived alone but also claims to have always lived with mother, father, and two grandparents.”

Then there are a series of bad analysis and modeling decisions Regnerus made, such as coding people who refused to answer a question as 0 instead of missing, or using the wrong kind of statistical model for the particular outcome.

When they get done with it, there really is no reliable, significant negative outcome associated with having lived any appreciable amount of time with a parent who might have been gay or lesbian. There’s more to it, but I don’t want to discourage you from reading the paper.

Random error, correlated outcome

Some of the “misclassified or uncertain” cases also report serious problems in adulthood, exhibiting higher-than-average rates of suicidality, depression, drinking to get drunk, and having a poor relationship with their mothers. So those could be people whose difficult lives rendered them unable to complete the life history calendar correctly. But there is also a chance that, like the 7’8″ guy, there are people just answering some of the question at random. These were people taking the survey alone on a computer, with no supervision, and getting paid to be part of the sample. Clicking at random is not out of the question (one person only took 10 minutes to complete the lengthy survey).

Contrary to what you might assume, clicking at random does not always produce random results. I’ll illustrate this with an example. First, here’s another tidbit from Regnerus, which might fit this point. Speaking to some Franciscans in 2014, Regnerus (just after 9:00 of this video) was going on about sexual fluidity as a condition of modernity, when he dropped in this fact from the NFSS:

Despite comprising a mere 1.3 percent of the population, respondents in the NFSS [New Family Structures Survey] who said that their mothers have had a same-sex sexual relationship made up 15 [50?] percent of all the asexual identifiers in the NFSS. So, 15 [50?] percent of them come from 1.3 percent of the population. [I originally transcribed those as 50%, but on second listening I think he said 15%, but I can’t be sure.]

His raised eyebrow here is to indicate the deeply depraved nature of lesbian mothers — maybe it’s genetic, or maybe it’s child abuse — but… he lets the numbers speak for themselves. Lesbian mothers, asexual children.

Here’s how this works. If you are trying to find people in two rare conditions — for example, those with lesbian mothers and those who are asexual — and a small portion of your sample answers questions at random, not only will you have a relatively large number of false positives on your conditions, your rare conditions will also falsely appear to be correlated.

I’m sure I didn’t discover this, and I don’t have a mathematical proof for it, but it’s logical. And I confirmed it with an experiment, as follows.

Say you have a sample of 1000 people, and you’re studying two conditions that occur on average in one out of every 500 cases. I’ll call them “climbing Mt. Everest” and “going to the moon.” In your thousand cases, you will on average have 2 people who did each thing. The chances that the same person did both are probably really low (you do the maths). But, if just 1% of your sample — 10 people — answer those two yes/no questions at random, look out!

I created this scenario using Excel’s random-number function. With 990 people answering truthfully — that is, given a 1/500 chance of saying yes to each question — and 10 answering them both randomly, this is what I got: 6 people who had climbed Mt. Everest, and 8 people who had gone to the moon. But shockingly, there were 4 people who had done both — that is 67% of the mountain climbers and 50% of the moonshotters. You can’t know, from looking at the data, but I can, that all of the people who went on both adventures were in the tiny group of random answerers.

Here are the 1000 cases in random order, with green showing Everest-only cases, blue showing moon-only cases, and red showing positive answers to both questions. And here’s the statistic: in the total sample — 990 serious survey takers and 10 jokers — the correlation between climbing Mt. Everest and going to the moon is .53! Click to enlarge:

rare event errors.xlsx

Maybe Regnerus is just an incredibly, irresponsibly bad researcher, who didn’t conduct the simplest data checks before rushing to publish his paper. Or maybe he is a diabolical genius, and he realized that high random error rates in both his rare independent variable and his rare dependent variables would produce results showing poor outcomes for children of gays and lesbians.

In the Cheng and Powell paper, their various procedures and corrections wipe out many of Regenerus’s negative outcomes for GF/LM respondents before they tackle the “misclassified or uncertain” cases. But when they do that, some of the last coefficients to fall to non-significance are indeed relatively rare: having suicidal thoughts (7%), not being “entirely heterosexual” (15%), having had an STI (11%), and having had forced sex (13%). Each of these becomes non-significant when the bad cases are controlled in the Cheng and Powell models. I haven’t worked out a proof (ever), but I reckon that the rarer they are, the more likely they are to be correlated with the rare independent variable (LM/GF) if some people are answering at random — which they apparently were.

Anyway, the Cheng and Powell paper speaks for itself. But I find it interesting that unchecked data error produces false positive (that is, negative) outcomes for marginal groups. Look out!

15 thoughts on “How random error and dirty data made Regnerus even wronger than we thought

  1. I hope people realize that much of what is pointed out in terms of data quality also holds for all of the on-line, non-random, “panels” that have proliferated as if they are different studies. They are not. It isn’t just Regnerus, several scholars have come to the opinion that quick and dirty non-random data are acceptable. I have reviewed three papers this year at top tier journals that used these garbage data. You can’t estimate population parameters from a convenience sample, particularly when there is no attempt to make sure that the responses are serious.

    Liked by 1 person

    1. Sherkat, you may find it fascinating to know that these panels are standard operating procedure in marketing research. It’s cheap, quick, and dirty, and some companies just love them. My firm doesn’t do them but we outsource at client request. And really, we’ll probably do them eventually: we’ve built a database of qual respondents. They’re more or less “professional market research participants”, which annoys the hell out of me. I hate interviewing the same person for two separate projects, which has happened before. Anyway, when that database gets big enough, some of our execs want to start using it for online. Welcome to social science for hire, for better or worse.

      Love your work, by the way…cited it for some of papers back in college.


  2. Might be useful for methods…

    Brenda Wilhelm, PhD
    Professor of Sociology
    402 Lowell-Heiny Hall
    Colorado Mesa University



  3. I think what is missing from this article is the actual data differences. Like for example, Cheng-Powell added in a buncha records that Regnerus had sliced out into their own separate category that Cheng-Powell put back into the main group. Regnerus had craftily removed from the Intact Biological Family category people who lived for their first 18 years with their biological parents, but then sometime after that their parents divorced. Cheng-Powell said if we are comparing straight headed families vs gay headed families you gotta put ALL the straight families in one bucket because you are measuring straight v gay, not the (best of the best Straight) v Gay.

    Cheng-Powell also corrected Regnerus coding people who didn’t answer as a 0 instead of missing. That MATTERED a LOT. That is not an acceptable way to classify the records. Another example is Cheng-Powell added a control variable as Cynthia Osborn UT-Austin a consultant on the study told him to do but Regnerus didn’t, they added a control variable for if your family received welfare growing up. Regnerus coded that as an “outcome,” Cheng-Powell went in and did it RIGHT and used that as a variable. BIG difference in the outcome data, so no wonder why Regnerus ignored the scientific advice.

    Basically Table 3 in the study shows that ALL the bad scary stuff like, “Were you ever sexually touched by a parent while growing up,” all of that SCARE data disappears when you do it right.

    Regenrus showed greater than 0.05 Level people who grew up with Lesbian moms responded “Yes, I was sexually touched by a parent growing up” however when you do an HONEST job there is NO DIFFERENCE. ALL of those SCARE outcomes they all disappear.

    I would just encourage everyone to accept Dr. Cohen’s invitation and ask him for a copy of the Pre-Press study. Table 3 is the main table to look at. The data shows that there is no harm to children who are raised by gay coupled parents in ALL of the survey questions, save if you are straight up straight. My reading of Cheng-Powell shows that Regenrus was scientific BULLSHIT and Cheng-Powell FINALLY after 3 long years exposes that. FINALLY! What took so long?

    Now Regenrus is going to come back and say what I have heard him say previously, something along the lines of “If you torture the data long enough you can get it to say whatever you want.” This is Bullshit, Cheng-Powell Did NOT torture the data at all, they merely exposed the dirty records and non generally accepted classifications & coding Regnerus used. You better run & hide Regnerus, your Gig is up!


  4. Regnerus et. al’s alleged concern for child welfare is phony-baloney. In the world right now there are many tens of thousands of homeless orphans at risk of starvation and death. People genuinely interested in child welfare, rather than in political gay bashing would be spending their time getting help for all of those tens of millions of homeless orphans, not contaminating the scholarly record to act out against all homosexuals.


  5. Someone tweeted me that my comment “You better run & hide Regnerus, your Gig is up!” may be read as a threat, like a physical threat. It isn’t, I mean it professionally, that Regnerus should ‘professionally’ run and hide. I guess nowadays you have to be so careful what you write so I just want to clear that up.


  6. Does this mean it’s ok to submit to SSR now? Before this debacle I kind of liked it as a nice normal science outlet. :/


    1. I respect what they’ve done here, and I have nothing against the new editor, but it’s still an Elsevier journal, which means — as we have learned from this whole experience — that it is not accountable to anyone. I will on rare occasions review for a corporate journal, and have even submitted recently for various reasons, but we’ve got to leave them and this model behind.


Comments welcome (may be moderated)

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s