Pew takes welcome steps to wean off fake generations (and some new research)

And some new research after that.

No one officially made the Pew Research Center the arbiter of fake generations, but through their prolific use of the categories, and the generally high quality and good reputation of their work on social trends, over time they have come to be seen as the authority on this, one of pop social sciences most celebrated myths. That’s why although I complained about the ridiculous things for many years, I eventually directed my complaint toward Pew, and two years ago rustled up a few hundred signatures from social scientists urging Pew to “stop using its generation labels,” and landing an op-ed in the Washington Post.

Much to their credit, the good researchers at Pew have listened to this. Around 2021 they stopped basing so many reports on fake generational categories — while still using age and birth year to break down social trends. And now they have released a package announcing major changes to their use of generations.

In the lead essay, Kim Parker, director of social trends research, lists several key guides to Pew reporting in the future. Quoting her:

  • We’ll only do generational analysis when we have historical data that allows us to compare generations at similar stages of life.
  • Even when we have historical data, we will attempt to control for other factors beyond age in making generational comparisons.
  • When we can’t do generational analysis, we still see value in looking at differences by age and will do so where it makes sense.
  • When we do have the data to study groups of similarly aged people over time, we won’t always default to using the standard generational definitions and labels.

Her conclusion is great:

By choosing not to use the standard generational labels when they’re not appropriate, we can avoid reinforcing harmful stereotypes or oversimplifying people’s complex lived experiences. With these considerations in mind, our audiences should not expect to see a lot of new research coming out of Pew Research Center that uses the generational lens. We’ll only talk about generations when it adds value, advances important national debates and highlights meaningful societal trends.

As a simple guide, a post from Pew Research Center President Michael Dimock offers, “5 things to keep in mind when you hear about Gen Z, Millennials, Boomers and other generations.” Quoting him:

  • Generational categories are not scientifically defined.
  • Generational labels can lead to stereotypes and oversimplification.
  • Discussions about generation often focus on differences instead of similarities.
  • Conventional views of generations can carry an upper-class bias.
  • People change over time.

This is a good summary of the problems with fake generation labels.

Finally, a how-to essay by Arnold Lau and Courtney Kennedy walks through how to test for generation effects in an age-period-cohort framework, with R code and CPS data. This looks quite useful (I don’t speak R, but the framing looks right to me). Unfortunately, their test of generation effects only uses the “standard” fake categories (Millennial, Gen X, Baby Boom, Silent), so it’s not really a test of “generation effects,” it’s a test of these generation effects. A recurring problem in this whole field is no one stops to ask how cohorts should be grouped, if at all, in the first place. It’s a ridiculous state of affairs.

Anyway, kudos to Pew. I hope literally one single solitary reporter or editor in the click-driven news media pays even a tiny shred of attention to this and considers the possibility of giving up a few (million) clicks to be, like, accurate in their reporting.

New research from Andrew Lindner, Sophia Stelboum, and Azizul Hakim

Meanwhile, Andrew Lindner and his co-authors at Skidmore College have gone to the trouble of conducting actual research on generational label identification. Their new paper, posted on SocArXiv, is “Embracing Generational Labels: An Analysis of Self-Identification and Sociopolitical Alignment.” And Andrew graciously shared their data and code with everyone, here. The gist of the relevant part for me is they did a weighted-representative online survey of 1,478 Americans and asked them what generation they identified with, and compared it with their birth years, and the commonly-used Pew categories (without showing people the standard birth dates). They offered respondents a list of generation labels, but also let them choose “between” statuses, like “In between Baby Boomer and Generation X.” Their conclusion is that “a majority of respondents self-identify with their ‘correct’ corresponding generational labels but individuals with birth years in the middle of the generational range exhibit much higher rates of self-identification.” The paper (well worth reading!) goes on to examine the association of generational identification with partisanship and other variables.

Most people in their survey could pick their Pew ID out of a lineup. How many? By my coding* of the Lindner et al. data, 63% of people correctly chose their exact label (that is, they ignored the “between” status decoys and choose correctly from the actual Pew categories). Here are the rates of exactly identifying the correct generation label, clearly showing the slippage near the boundaries, as people choose “between” categories (and get counted wrong by this measure):

Those big waves, to me, imply people are trying to answer “correctly” based on birth year rather than naming an identity affiliation. But I could be wrong. Anyway, then I relax the restriction and let people be “correct” if they chose a “between” status and were born between midpoints of the relevant categories. That gets you up to 82% correct. By this generous metric, a person born in either 1960 or 1970 — after the Baby Boom midpoint and before the Gen X midpoint — would be correct if they chose Baby Boomer, “in between Baby Boomer and Generation X,” or Gen X. By this looser criteria, actual Baby Boomers were right most often (90%), while Gen X (79%) and Millennials (81%) are worst, and Gen Z scored 84%. Here are the scores for the generous metric, by birth year:

Here’s another representation of the data, showing each person’s (unweighted) identification by birth year. Individuals are sorted by birth year and by generation ID within birth years, so you can get an idea of how they were wrong:

Clearly people are reading these linearly — that is, almost no one is saying, “sure I’m only 20, but I really feel like a Baby Boomer.” Their “wrong” answers are almost always naming the neighboring category. Maybe that’s how people do surveys – like a test, unless you really encourage them to think about it some other way.

So, what’s the conclusion? Do the “generation” labels work as identities? If we thought we knew the true ascriptive categories — as we do with, say, sex or age — we wouldn’t accept birth year as an accurate measurement of generation identity: 63-82% is not good enough. And the way generations are used, it’s always coded by birth year, not self identification. So, that’s not acceptable if you consider these as identities. They don’t match the underlying conditions well enough to use birth year as a measurement of identity. That’s my opinion. Compare it with race/ethnicity. If we could ascertain the self-identification of people’s parents, and then asked them their race/ethnicity, we would expect higher agreement than we’re getting here with generation labels. But if you’re just using “generation” labels to set cut points for age cohorts, it doesn’t matter if people identify with them – then just stop treating them like identities.

Anyway, thanks to Andrew for doing the study and sharing the paper and data!


Here’s my Stata coding for the numbers figures above:

use gen_self_id_data_public.dta , clear

recode birthyr (1940/1945=0) (1946/1964=1) (1965/1980=3) (1981/1996=5) (1997/2004=7), gen(actualgen)
gen correcta = gen_id==actualgen /* do they name they correct exact generation */

recode birthyr (1955/1972=2) (1973/1988=4) (1989/2004=6), gen(betweengen)
replace betweengen = . if betweengen>6

gen correctb = (gen_id==actualgen) | (gen_id==betweengen) /* do they correctly name either the exact generation or a corrrect “between” generation */

drop if birthyr<1946 /* these people don’t fit, and can’t be “between” any of the others */

sum correctb [w=weight]
bysort actualgen: sum correctb [w=weight]

reg correctb i.birthyr [w=weight]
margins birthyr /* save these values */

reg correcta i.birthyr [w=weight]
margins birthyr /* save these values */

/* save the two margins results – which are just means – and put them back in as a new dataset, then draw two figures: */

twoway scatter correcta birthyr || mspline correcta birthyr, bands(10) xlab(1945(5)2005) xline(1964.5 1980.5 1996.5) xti(Birth year) yti(Proportion correct) ti(“Proportion exactly identifying generation label, by birth year”) note(“PN Cohen analysis of Lindner et al. data”, span size(2))

twoway scatter correctb birthyr || mspline correctb birthyr, bands(10) xlab(1945(5)2005) xline(1964.5 1980.5 1996.5) xti(Birth year) yti(Proportion correct) ti(“Proportion loosely identifying generation label, by birth year”) note(“PN Cohen analysis of Lindner et al. data”, span size(2))

Comments welcome (may be moderated)