Tag Archives: data

Data analysis: Are older newlyweds saving marriage?

COS open data badgeCOS Open Materials badge


Is the “institution” still in decline if the incidence of marriage rebounds, but only at older ages?

In my new book I’ve revisited old posts and produced this figure, which shows the refined marriage rate* from 1940 to 2015, with a discussion of possible futures:

f15

The crash scenario – showing marriage ending around 2050, is there to show where the 1950-2014 trajectory is headed (it’s also a warning against using linear extrapolation to predict the future). The rebound scenario is intended to show how unrealistic the “revive marriage culture” people are. The taper scenario emerges as the most likely alternative; in fact, it’s grown more likely since I first made the figure a few years ago, as you can see by the 2010-2014 jag.

So let’s consider the tapering scenario more substantively — what would it look like? One way to get a declining marriage rate is if marriage is increasingly delayed, even if it doesn’t become less common; people still marry, but later. (If everyone got married at age 99, we would have universal marriage and a very low refined marriage rate.) I give some evidence for this scenario here.

These trends are presented with minimal discussion; I’m not looking at race/ethnicity or social class, childbearing or the recession; I’m not discussing divorce and remarriage and cohabitation, and I’m not testing hypotheses. (This is a list of research suggestions!) To make the subject more enticing as a research topic (and for accountability), I’ve shared the Census data, Stata code, and spreadsheet file used to make this post in this OSF project. You can use anything there you want. You can also easily fork the project — that is, make a duplicate of its contents, which you then own, and take off on your own trajectory, by adding to or modifying them.

Trends

For some context, here is the trend in percentage of men and women ever married, by age, from 1960. (“Ever married” means currently married, separated, divorced, or widowed.) This clearly shows both life-course delay and lifetime decline, but delay is much more prominent, at least so far. Even now, almost 90% of people have been married by age 60 or so, while the marriage rates for people under 35 have plummeted.

evmar6016

People become ever-married when they get first-married. We measure ever-married prevalence from a survey question on current marital status, but first-marriage incidence requires a question like the American Community Survey asks, “In the past 12 months, did this person get married?” Because they also ask how many times each person has been married, you can calculate a first marriage rate with this ratio:

(once married & married in the past 12 months) / (never married + (once married & married in the past 12 months))

Until recently it hasn’t been easy to measure first-marriage across all ages; now that we have the ACS marital events data (since 2008) we can. This allows us to look at the timing of first marriage, which means we can use current age-specific first-marriage rates to project lifetime ever-married rates under current conditions.

Here are the first-marriage rates for men and women, by age. Each set of bars shows the trend from 2008 to 2016. The left side shows men, by age; the right side shows women, by age; the totals for men and women are in the middle. This shows that first-marriage rates have fallen for men and women under age 35, but increased for those over age 35. The total first-marriage rate has rebounded from the 2013 crater, but is still lower than 2008.

1stmarage

This is a short-range trend, 9 years. It could be recession-specific, with people delaying marriage because of hardships, or relationships falling apart under economic stress, and then hurrying to marry a few years later. But it also fits the long-term trend of delay over decline.

The overall rates for men and women show that the 2014-2016 rebound has not brought first-marriage rates back to their 2008 level. However, what about lifetime odds of marriage? The next figure uses women’s age-specific first-marriage rates to project lifetime odds of marriage for three years: 2008, the 2013 crater, and 2016. This shows, for example, that at 2008 rates 59% of women would have married by age 30, compared with 53% in both 2013 and 2016.

1stmarproj

The 2013 and 2016 lines diverge after age 30, and by age 65 the projected lifetime ever-married rates have fully recovered. This implies that marriage has been delayed, but not forgone (or denied).

Till now I’ve shown age and sex-specific rates, but haven’t addressed other things that might changed in the never-married population. Finally, I estimated logistic regressions predicting first-marriage among never married men and women. The models include race, Hispanic origin, nativity, education, and age. In addition to the year and age patterns above, the models show that all races have lower rates than Whites, Hispanics have lower rates than non-Hispanics, foreign-born people have higher rates (which explains the Hispanic result), and people with more education first-marry more (code and results in the OSF project).

To see whether changes in these other variables change the story, I used the regressions to estimate first-marriage rates at the overall mean of all variables. These show a significant rebound from the bottom, but not returning to 2008 levels, quite similar to the unadjusted trends above:

1stmaradj

This is all consistent with the taper scenario described at the top. Marriage delayed, which reduces the annual marriage rate, but with later marriage picking up much of the slack, so that the decline in lifetime marriage prevalence is modest.


* The refined marriage rate is the number of marriages as a fraction of unmarried people. This is more informative than the crude marriage rate (which the National Center for Health Statistics tracks), which is marriages as a fraction of the total population. In this post I use what I guess you would call an age-specific refined first-marriage rate, defined above.

1 Comment

Filed under Research reports

Stop me before I fake again

In light of the news on social science fraud, I thought it was a good time to report on an experiment I did. I realize my results are startling, and I welcome the bright light of scrutiny that such findings might now attract.

The following information is fake.

An employee training program in a major city promises basic job skills and as well as job search assistance for people with a high school degree and no further education, ages 23-52 in 2012. Due to an unusual staffing practice, new applications were for a period in 2012 allocated at random to one of two caseworkers. One provided the basic services promised but nothing extra. The other embellished his services with extensive coaching on such “soft skills” as “mainstream” speech patterns, appropriate dress for the workplace, and a hard work ethic, among other elements. The program surveyed the participants in 2014 to see what their earnings were in the previous 12 months. The data provided to me does not include any information on response rates, or any information about those who did not respond. And it only includes participants who were employed at least part-time in 2014. Fortunately, the program also recorded which staff member each participant was assigned to.

Since this provides such an excellent opportunity for studying the effects of soft skills training, I think it’s worth publishing despite these obvious weaknesses. To help with the data collection and analysis, I got a grant from Big Neoliberal, a non-partisan foundation.

The data includes 1040 participants, 500 of whom had the bare-bones service and 540 of whom had the soft-skills add-on, which I refer to as the “treatment.” These are the descriptive statistics:

fake-descriptives

As you can see, the treatment group had higher earnings in 2014. The difference in logged annual earnings between the two groups is significant at p

fake-ols-results

As you can see in Model 1, the Black workers in 2014 earned significantly less than the White workers. This gap of .15 logged earnings points, or about 15%, is consistent with previous research on the race wage gap among high school graduates. Model 2 shows that the treatment training apparently was effective, raising earnings about 11%. However, The interactions in Model 3 confirm that the benefits of the treatment were concentrated among the Black workers. The non-Black workers did not receive a significant benefit, and the treatment effect among Black workers basically wiped out the race gap.

The effects are illustrated, with predicted probabilities, in this figure:

fake-marginsplot

Soft skills are awesome.

I have put the data file, in Stata format, here.

Discussion

What would you do if you saw this in a paper or at a conference? Would you suspect it was fake? Why or why not?

I confess I never seriously thought of faking a research study before. In my day coming up in sociology, people didn’t share code and datasets much (it was never compulsory). I always figured if someone was faking they were just changing the numbers on their tables to look better. I assumed this happens to some unknown, and unknowable, extent.

So when I heard about the Lacour & Green scandal, I thought whoever did it was tremendously clever. But when I looked into it more, I thought it was not such rocket science. So I gave it a try.

Details

I downloaded a sample of adults 25-54 from the 2014 ACS via IPUMS, with annual earnings, education, age, sex, race and Hispanic origin. I set the sample parameters to meet the conditions above, and then I applied the treatment, like this:

First, I randomly selected the treatment group:

gen temp = runiform()
gen treatment=0
replace treatment = 1 if temp >= .5
drop temp

Then I generated the basic effect, and the Black interaction effect:

gen effect = rnormal(.08,.05)
gen beffect = rnormal(.15,.05)

Starting with the logged wage variable, lnwage, I added the basic effect to all the treated subjects:

replace newlnwage = lnwage+effect if treatment==1

Then added the Black interaction effect to the treated Black subjects, and subtracted it from the non-treated ones.

replace newlnwage = newlnwage+beffect if (treatment==1 & black==1)
replace newlnwage = newlnwage-beffect if (treatment==0 & black==1)

This isn’t ideal, but when I just added the effect I didn’t have a significant Black deficit in the baseline model, so that seemed fishy.

That’s it. I spent about 20 minutes trying different parameters for the fake effects, trying to get them to seem reasonable. The whole thing took about an hour (not counting the write-up).

I put the complete fake files here: code, data.

Would I get caught for this? What are we going to do about this?

BUSTED UPDATE:

In the comments, ssgrad notices that if you exponentiate (unlog) the incomes, you get a funny list — some are binned at whole numbers, as you would expect from a survey of incomes, and some are random-looking and go out to multiple decimal places. For example, one person reports an even $25,000, and another supposedly reports $25251.37. This wouldn’t show up in the descriptive statistics, but is kind of obvious in a list. Here is a list of people with incomes between $20000 and $26000, broken down by race and treatment status. I rounded to whole numbers because even without the decimal points you can see that the only people who report normal incomes are non-Blacks in the non-treatment group. Busted!

fake-busted-tableSo, that only took a day — with a crowd-sourced team of thousands of social scientists poring over the replication file. Faith in the system restored?

12 Comments

Filed under In the news, Research reports

What is ‘nationally representative,’ and did Regnerus have it?

I’m off to Minneapolis to present a talk tomorrow on “The Regnerus Affair” at the Minnesota Population Center, subtitle: “Gay Marriage, the Supreme Court, and the Politics of Sociology.”

In my preparation, I was putting together notes from previous posts, the critique I co-authored with Andrew Perrin and Neal Caren, the infamous paper itself, and the media coverage of the scandal. And one piece of it I never really questioned got me thinking: his insistence that his dataset was a “a random, nationally-representative sample of the American population.” The news media repeated this assertion routinely, but what does it mean?

The data, collected by Knowledge Networks, are definitely not truly random. But not much is. They have standing panel of participants who get rewards for participating in a certain number of online surveys. The recruitment of the original panel is where the randomness comes in, with dialing (more or less) random phone numbers. But who chooses to be in it is not random, of course. What the firm does, then, is apply weights to the sample. That is, you don’t count each person as one person, you count them as a certain multiple of a person, so that the weighted total sample looks like the target population — in this case all noninstitutionalized American adults ages 18-39.

In the paper, Regnerus offers an appendix which compares his New Family Structures Study to the national population as represented in better, larger samples, such as the Current Population Survey (CPS). He writes:

Appendix A presents a comparison of age-appropriate summary statistics from a variety of socio-demographic variables in the NFSS, alongside the most recent iterations of the Current Population Survey, the National Longitudinal Study of Adolescent Health (Add Health), the National Survey of Family Growth, and the National Study of Youth and Religion—all recent nationally-representative survey efforts. The estimates reported there suggest the NFSS compares very favorably with other nationally-representative datasets.

So, he eyeballs the comparisons and determines the result is “very favorable.” I had previously eyeballed the first few rows of that table and reached the same conclusion. This is the distribution of age, race/ethnicity, region and sex from that table:

nfss comparisonsSo, it looks very similar to the national population as counted by the benchmark CPS. But both of these surveys are weighted on these factors. That is, after the sample is drawn, they change the counts of people to make them match what we know from Census data (which are weighted, too, incidentally). So the fact that NFSS matches CPS on this characteristics just means they did the weights right, so far.

Think about it this way. If I collect data on 6 men and 4 women, it’s easy to call my data “representative” if I weight those 6 men by .83 and the 4 women by 1.25. The more variables you try to match on the harder the math gets, but the principle is the same.

But now I looked further down the table, and Regnerus’s data don’t compare “very favorably” to the national data on some other variables. Here are household income (from CPS) and self-reported health (from the National Survey of Family Growth):

nfss-income

nfss-healthThis means that, when you apply the weights to the NFSS data, which produces comparable distributions on age, sex, race/ethnicity and region, you get a sample that is quite a bit poorer and less healthy than the national average as represented by the better surveys.

I was confused by this partly because according to the Knowledge Networks documentation on the NFSS, income was one of the weighting variables.

I don’t know how big an issue this is. Do you? And do you know of a standard by which a researcher or research firm can declare data “nationally representative” in this age of small, fast, low-response, online surveys?

 

8 Comments

Filed under Research reports