On artificially intelligent gaydar

A paper by Yilun Wang and Michal Kosinski reports being able to identify gay and lesbian people from photographs using “deep neural networks,” which means computer software.

I’m not going to describe it in detail here, but the gist of it is they picked a large sample of people from a dating website who said they were looking for same-sex partners, and an equal number that were looking for different-sex partners, and trained their computers to learn the facial features that could distinguish the two groups (including facial structure measurements as well as grooming things like hairline and facial hair). For a deep dive on the context of this kind of research and its implications, and more on the researchers and the controversy, please read this post by Greggor Mattson first. These notes will be most useful after you’ve read that.

I also reviewed a gaydar paper five years ago, and some of the same critiques apply.

This figure from the paper gives you an idea:

gd4

These notes are how I would start my peer review, if I was peer reviewing this paper (which is already accepted and forthcoming in the Journal of Personality and Social Psychology — so much for peer review [just kidding it’s just a very flawed system]).

The gay samples here are “very” gay, in the sense of being out and looking for same-sex partners. This does not mean that they are “very” gay in any biological, or born-this-way sense. If you could quantitatively score people on the amount of their gayness (say on some kind of scale…), outness and same-sex attraction might be correlated, but they are different things. The correlation here is assumed, and assumed to be strong, but this is not demonstrated. (It’s funny that they think they address the problem of the sample by comparing the results with a sample from Facebook of people who like pages such as “I love being gay” and “Manhunt.”)

Another way of saying this is that the dependent variable is poor defined, and then conclusions from studying it are generalized beyond the bounds of the research. So I don’t agree that the results:

provide strong support provide strong support for the PHT [prenatal hormone theory], which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.

If it were my study I might say the results are “consistent” with PHT theory, but it would be better to say, “not inconsistent” with the theory. (There is no data about hormones in the paper, obviously.)

The authors give too much weight to things their results can’t say anything about. For example, gay men in the sample are less likely to have beards. They write:

nature and nurture are likely to be as intertwined as in many other contexts. For example, it is unclear whether gay men were less likely to wear a beard because of nature (sparser facial hair) or nurture (fashion). If it is, in fact, fashion (nurture), to what extent is such a norm driven by the tendency of gay men to have sparser facial hair (nature)? Alternatively, could sparser facial hair (nature) stem from potential differences in diet, lifestyle, or environment (nurture)?

The statement is based on the faulty premise that they are “nature and nurture are likely to be as intertwined.” They have no evidence of this intertwining. They could just as well have said “it’s possible nature and nurture are intertwined,” or, with as much evidence, “in the unlikely event nature and nurture are intertwined.” So they loaded the discussion with the presumption of balance between nature and nurture, and then go on to speculate about sparse facial hair, for which they also have no evidence. (This happens to be the same way Charles Murray talks about race and IQ: there must be some intertwining between genetics and social forces, but we can’t say how much; now let’s talk about genetics because it’s definitely in there.)

Aside from the flaws in the study, the accuracy rate reported is easily misunderstood, or misrepresented. To choose one example, the Independent wrote:

According to its authors, who say they were “really disturbed” by their findings, the accuracy of an AI system can reach 91 per cent for homosexual men and 83 per cent for homosexual women.

The authors say this, which is important but of course overlooked in much of the news reporting:

The AUC = .91 does not imply that 91% of gay men in a given population can be identified, or that the classification results are correct 91% of the time. The performance of the classifier depends on the desired trade-off between precision (e.g., the fraction of gay people among those classified as gay) and recall (e.g., the fraction of gay people in the population correctly identified as gay). Aiming for high precision reduces recall, and vice versa.

They go on to give a technical, and I believe misleading example. People should understand that the computer was always picking between two people, one of whom was identified as gay and the other not. It had a high percentage chance of getting that choice right. That’s not saying, “this person is gay”; it’s saying, “if I had to choose which one of these two people is gay, knowing that one is, I’d choose this one.” What they don’t answer is this: Given 100 random people, 7 of whom are gay, how many would the model correctly identify yes or no? That is the real life question most people probably think the study is answering.

As technology writer Hal Hodson pointed out on Twitter, if someone wanted to scan a crowd and identify a small number individuals who were likely to be gay (and ignoring many other people in the crowd who are also gay), this might work (with some false positives, of course).

gd1

Probably someone who wanted to do that would be up to no good, like an oppressive government or Amazon, and they would have better ways of finding gay people (like at pride parades, or looking on Facebook, or dating sites, or Amazon shopping history directly — which they already do of course). Such a bad actor could also train people to identify gay people based on many more social cues; the researchers here compare their computer algorithm to the accuracy of untrained people, and find their method better, but again that’s not a useful real-world comparison.

Aside: They make the weird but rarely-necessary-to-justify decision to limit the sample to White participants (and also offer no justification for using the pseudoscientific term “Caucasian,” which you should never ever use because it doesn’t mean anything). Why couldn’t respondents (or software) look at a Black person and a White person and ask, “Which one is gay?” Any artificial increase in the homogeneity of the sample will increase the likelihood of finding patterns associated with sexual orientation, and misleadingly increase the reported accuracy of the method used. And of course statements like this should not be permitted: “We believe, however, that our results will likely generalize beyond the population studied here.”

Some readers may be disappointed to learn I don’t think the following is an unethical research question: Given a sample of people on a dating site, some of whom are looking for same-sex partners and some of whom are looking for different-sex partners, can we use computers to predict which is which? To the extent they did that, I think it’s OK. That’s not what they said they were doing, though, and that’s a problem.

I don’t know the individuals involved, their motivations, or their business ties. But if I were a company or government in the business of doing unethical things with data and tools like this, I would probably like to hire these researchers, and this paper would be good advertising for their services. It would be nice if they pledged not to contribute personally to such work, especially any efforts to identify people’s sexual orientation without their consent.

11 Comments

Filed under Research reports

11 responses to “On artificially intelligent gaydar

  1. Pingback: God, goons, and gays: 3 quick takes - Statistical Modeling, Causal Inference, and Social Science

  2. LM

    Not to mention that bisexuals exist and may be looking for same-gender or different-gender partners. There’s so many things wrong with this study, so thanks for tackling that.

    Liked by 1 person

  3. Pingback: Tracking Wang and Kosinski’s AI Gayface Debacle – Greggor Mattson

  4. Pingback: Tracking Wang and Kosinski’s AI Gayface Controversy – Greggor Mattson

  5. Pingback: Artificial Intelligence Discovers Gayface. Sigh. – Greggor Mattson

  6. JB Harshaw

    You stated:

    >>What they don’t answer is this: Given 100 random people, 7 of whom are gay, how many would the model correctly identify yes or no? That is the real life question most people probably think the study is answering.<<

    But an even BETTER (and more realistic) test would be to perform the "given 100 random people" across several groups that had DIFFERENT numbers of "gay" people within them — one group that was 100% "straight" another that was 100% "gay"; and various mixtures in between — and THEN see how "accurate" the thing was.

    Why? Because the dispersion of "gay" people across the population is NOT necessarily uniform, it is only an aggregate average (and a "mean" average at that; moreover it's a specious number; AFAIK, we really don't know what the avg/mean percentage of "gay" men is in the population, whether it is 7% or 3% or 10%, and indeed there isn't even really a set definition of "gay", does it include "bisexuals" or not? Were people who identified as "Bi" included in the sample data set, or were they specifically and categorically excluded, etc.)

    My bet is that under a set of "100 random people" tests like that it would proceed to identify something near 7% of EACH group (or quite frankly whatever % the software was "primed" with).

    NOTE: They (kinda/sorta) claim to have (kinda/sorta) done this — via two tests, the first where "When asked to select the 100 males most likely to be gay, only 47 of those chosen by the system actually were" (meaning BTW 53 were not — also interesting that it ended up with basically a 50/50 mix isn't it… almost like the "here's two pictures, one of them is gay" ratio… also since the source dataset supposedly included EQUAL numbers of gay men and straight men, the 50/50 result is really just a "random" sample…Hmmmm so much for that.)

    Then, the second (third?) test — which they did to attempt to refute the prior demonstration of the system's abysmal "accuracy" — where they had the system "pick out the ten faces it was most confident about, nine of the chosen were in fact gay." This is problematic in several ways; first why was "confident about" related ONLY to picking the "gay" members, and NOT the "straight" members. Gee, could it be that the sample data included a lot of (or at least 9) rather shall we say "flamboyant" ("fabulous"??) photos?

    It is important to note that they DIDN'T do the inverse of those tests. They DIDN'T say "Select the 100 males most likely to be straight" (which I'd bet would have AGAIN resulted in a nearly 50/50 split — or whatever the "mix" ratio was from the set the 100 were being sampled from; that is, if the larger sample set were reduced to include 70 {randomly selected} gay men out of 1,000; then I'd bet that the "100 most likely to be straight" would probably include something NEAR to 93 straight men and 7 gay men).

    In short, this study isn't "science" at all… it's not even just POOR "study" design, it's completely SHITTY even FRAUDULENT "study" design; and at its root is basically just a stacked deck "con" operation, with a bit of statistical prestidigitation; more akin to phrenology (or even 3 card monty) than anything else.

    Liked by 1 person

    • College Planner

      I completely agree. This “study” was not the least bit sciencey… I love your reference to the 3 card monty, because that’s exactly what it feels like. Rock, paper, scissors anyone?

      Like

  7. Just out of curiosity: “Caucasian” is pseudoscientific term. What term should be then used for description of people of mainly European, middle-eastern and northern-african ancestry (i.e. people who are traditionally described as “caucasian”, or whose population cluster together in genetical analysis)?

    Liked by 1 person

  8. Crystal

    It looks like the composite heterosexual woman is wearing more makeup and has blonder hair than the composite lesbian. That seems as if it would be attributable to fashion, specifically fashion such as blonde hair that might be worn to – dun-dun! – attract heterosexual men! That’s not biological, that’s a specifically cultural fashion fad. Ugh. I’m not even a biologist and I can spot the holes in this.

    Likewise, the hetero men with beards – maybe it’s because straight women in the US like beards, so men who want to attract women grow them? Nothing to do with hetero men having more facial hair naturally or whatever, just “how do I attract the most members of the desired gender?” Which is as old as time.

    Liked by 1 person

  9. Pingback: The invention of AI ‘gaydar’ could be the start of something much worse | Cubit10

Comments welcome (may be moderated)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s