A paper by Yilun Wang and Michal Kosinski reports being able to identify gay and lesbian people from photographs using “deep neural networks,” which means computer software.
I’m not going to describe it in detail here, but the gist of it is they picked a large sample of people from a dating website who said they were looking for same-sex partners, and an equal number that were looking for different-sex partners, and trained their computers to learn the facial features that could distinguish the two groups (including facial structure measurements as well as grooming things like hairline and facial hair). For a deep dive on the context of this kind of research and its implications, and more on the researchers and the controversy, please read this post by Greggor Mattson first. These notes will be most useful after you’ve read that.
I also reviewed a gaydar paper five years ago, and some of the same critiques apply.
This figure from the paper gives you an idea:
These notes are how I would start my peer review, if I was peer reviewing this paper (which is already accepted and forthcoming in the Journal of Personality and Social Psychology — so much for peer review [just kidding it’s just a very flawed system]).
The gay samples here are “very” gay, in the sense of being out and looking for same-sex partners. This does not mean that they are “very” gay in any biological, or born-this-way sense. If you could quantitatively score people on the amount of their gayness (say on some kind of scale…), outness and same-sex attraction might be correlated, but they are different things. The correlation here is assumed, and assumed to be strong, but this is not demonstrated. (It’s funny that they think they address the problem of the sample by comparing the results with a sample from Facebook of people who like pages such as “I love being gay” and “Manhunt.”)
Another way of saying this is that the dependent variable is poor defined, and then conclusions from studying it are generalized beyond the bounds of the research. So I don’t agree that the results:
provide strong support provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.
If it were my study I might say the results are “consistent” with PHT theory, but it would be better to say, “not inconsistent” with the theory. (There is no data about hormones in the paper, obviously.)
The authors give too much weight to things their results can’t say anything about. For example, gay men in the sample are less likely to have beards. They write:
nature and nurture are likely to be as intertwined as in many other contexts. For example, it is unclear whether gay men were less likely to wear a beard because of nature (sparser facial hair) or nurture (fashion). If it is, in fact, fashion (nurture), to what extent is such a norm driven by the tendency of gay men to have sparser facial hair (nature)? Alternatively, could sparser facial hair (nature) stem from potential differences in diet, lifestyle, or environment (nurture)?
The statement is based on the faulty premise that they are “nature and nurture are likely to be as intertwined.” They have no evidence of this intertwining. They could just as well have said “it’s possible nature and nurture are intertwined,” or, with as much evidence, “in the unlikely event nature and nurture are intertwined.” So they loaded the discussion with the presumption of balance between nature and nurture, and then go on to speculate about sparse facial hair, for which they also have no evidence. (This happens to be the same way Charles Murray talks about race and IQ: there must be some intertwining between genetics and social forces, but we can’t say how much; now let’s talk about genetics because it’s definitely in there.)
Aside from the flaws in the study, the accuracy rate reported is easily misunderstood, or misrepresented. To choose one example, the Independent wrote:
According to its authors, who say they were “really disturbed” by their findings, the accuracy of an AI system can reach 91 per cent for homosexual men and 83 per cent for homosexual women.
The authors say this, which is important but of course overlooked in much of the news reporting:
The AUC = .91 does not imply that 91% of gay men in a given population can be identified, or that the classification results are correct 91% of the time. The performance of the classifier depends on the desired trade-off between precision (e.g., the fraction of gay people among those classified as gay) and recall (e.g., the fraction of gay people in the population correctly identified as gay). Aiming for high precision reduces recall, and vice versa.
They go on to give a technical, and I believe misleading example. People should understand that the computer was always picking between two people, one of whom was identified as gay and the other not. It had a high percentage chance of getting that choice right. That’s not saying, “this person is gay”; it’s saying, “if I had to choose which one of these two people is gay, knowing that one is, I’d choose this one.” What they don’t answer is this: Given 100 random people, 7 of whom are gay, how many would the model correctly identify yes or no? That is the real life question most people probably think the study is answering.
As technology writer Hal Hodson pointed out on Twitter, if someone wanted to scan a crowd and identify a small number individuals who were likely to be gay (and ignoring many other people in the crowd who are also gay), this might work (with some false positives, of course).
Probably someone who wanted to do that would be up to no good, like an oppressive government or Amazon, and they would have better ways of finding gay people (like at pride parades, or looking on Facebook, or dating sites, or Amazon shopping history directly — which they already do of course). Such a bad actor could also train people to identify gay people based on many more social cues; the researchers here compare their computer algorithm to the accuracy of untrained people, and find their method better, but again that’s not a useful real-world comparison.
Aside: They make the weird but rarely-necessary-to-justify decision to limit the sample to White participants (and also offer no justification for using the pseudoscientific term “Caucasian,” which you should never ever use because it doesn’t mean anything). Why couldn’t respondents (or software) look at a Black person and a White person and ask, “Which one is gay?” Any artificial increase in the homogeneity of the sample will increase the likelihood of finding patterns associated with sexual orientation, and misleadingly increase the reported accuracy of the method used. And of course statements like this should not be permitted: “We believe, however, that our results will likely generalize beyond the population studied here.”
Some readers may be disappointed to learn I don’t think the following is an unethical research question: Given a sample of people on a dating site, some of whom are looking for same-sex partners and some of whom are looking for different-sex partners, can we use computers to predict which is which? To the extent they did that, I think it’s OK. That’s not what they said they were doing, though, and that’s a problem.
I don’t know the individuals involved, their motivations, or their business ties. But if I were a company or government in the business of doing unethical things with data and tools like this, I would probably like to hire these researchers, and this paper would be good advertising for their services. It would be nice if they pledged not to contribute personally to such work, especially any efforts to identify people’s sexual orientation without their consent.