Tag Archives: graphics

COVID-19 graphs, with data and code

Updated March 25.

Although I’m not an expert on pandemic analysis, I am naturally following the COVID-19 data as best I can. And because I always understand data better when I make the figures myself, I’ve been making and looking at COVID-19 trend data, and sharing it as I go.

The figures below are the latest I made as of March 18 25 29, but you can click on the images to link to the current version. The figures, as well as data files and code, are in an Open Science Framework project, here: osf.io/wd2n6/, under CC0 license (free to use for any purpose). The project updates automatically as I go, but these figures won’t (because this is an old fashioned blog).

First, across countries:

country cases and deaths

For this one, to put the diverse US in perspective, in included US states in addition to selected countries. These are deaths.

countries and states since 10 deaths

State cases and deaths, per capita:

state cases and death rates bar

Finally, one with commentary: The first month, in numbers and Trump’s winning words:

Microsoft PowerPoint - first month of winning coronavirus.pptx



Filed under In the news, Me @ work

Do rich people like bad data tweets about poor people? (Bins, slopes, and graphs edition)

Almost 2,000 people retweeted this from Brad Wilcox the other day.


Brad shared the graph from Charles Lehman (who noticed later that he had mislabeled the x-axis, but that’s not the point). First, as far as I can tell the values are wrong. I don’t know how they did it, but when I look at the 2016-2018 General Social Survey, I get 4.3 average hours of TV for people in the poorest families, and 1.9 hours for the richest. They report higher highs (looks like 5.3) and lower lows (looks like 1.5). More seriously, I have to object to drawing what purports to be a regression line as if those are evenly-spaced income categories, which makes it look much more linear than it is.

I fixed those errors — the correct values, and the correct spacing on the x-axis — then added some confidence intervals, and what I get is probably not worth thousands of self-congratulatory woots, although of course rich people do watch less TV. Here is my figure, with their line (drawn in by hand) for comparison:


Charles and Brad’s post got a lot of love from conservatives, I believe, because it confirmed their assumptions about self-destructive behavior among poor people. That is, here is more evidence that poor people have bad habits and it’s just dragging them down. But there are reasons this particular graph worked so well. First, the steep slope, which partly results from getting the data wrong. And second, the tight fit of the regression line. That’s why Brad said, “Whoa.” So, good tweet — bad science. (Surprise.) Here are some critiques.

First, this is the wrong survey to use. Since 1975, GSS has been asking people, “On the average day, about how many hours do you personally watch television?” It’s great to have a continuous series on this, but it’s not a good way to measure time use because people are bad at estimating these things. Also, GSS is not a great survey for measuring income. And it’s a pretty small sample. So if those are the two variables you’re interested in, you should use the American Time Use Survey (available from IPUMS), in which respondents are drawn from the much larger Current Population Survey samples, and asked to fill out a time diary. On the other hand, GSS would be good for analyzing, for example, whether people who believe the Bible is the “the actual word of God and is to be taken literally, word for word” watch TV more than those who believe it is “an ancient book of fables, legends, history, and moral precepts recorded by men” (Yes, they do, about an hour more.) Or looking at all the other social variables GSS is good for.

On the substantive issue, Gray Kimbrough pointed out that the connection between family income and TV time may be spurious, and is certainly confounded with hours spent at work. When I made a simple regression model of TV time with family income, hours worked, age, sex, race/ethnicity, education, and marital status (which again, should be done better with ATUS), I did find that both hours worked and family income had big effects. Here they are from that model, as predicted values using average marginal effects.

tv work faminc

The banal observation that people who spend more time working spend less time watching TV probably wouldn’t carry the punch. Anyway, neither resolves the question of cause and effect.

Fits and slopes

On the issue of the presentation of slopes, there’s a good lesson here. Data presentation involves trading detail for clarity. And statistics have both have a descriptive and analytical purpose. Sometimes we use statistics to present information in simplified form, which allows better comprehension. We also use statistics to discover relationships we couldn’t otherwise — such as multivariate relationships that you can’t discern visually. The analyst and communicator has to choose wisely what to present. A good propagandist knows what to manipulate for political effect (a bad one just tweets out crap until they get lucky).

Here’s a much less click-worthy presentation of the relationship between family income and TV time. Here I truncate the y-axis at 12 hours (cutting off 1% of the sample), translate the binned income categories into dollar values at the middle of each category, and then jitter the scatterplot so you can see how many points are piled up in each spot. The fitted line is Stata’s median spline, with 9 bands specified (so it’s the median hours at the median income in 9 locations on the x-axis). I guess this means that, at the median, rich people in America watch about an hour of TV per day less than poor people, and the action is mostly under $50,000 per year. Woot.

gss tv income

Finally, a word about binning and the presentation of data (something I’ve written about before, here and here). We make continuous data into categories all the time, starting from measurement. We usually measure age in years, for example, although we could measure it in seconds or decades. Then we use statistics to simplify information further, for example by reporting averages. In the visual presentation of data, there is a particular problem with using averages or data bins to show relationships — you can show slopes that way nicely, but you run the risk of making relationships look more closely correlated than they are. This happens in the public presentation of data when analysts are showing something of their work product — such as a scatterplot with a fitted line — to demonstrate the veracity of their findings. When they bin the data first, this can be very misleading.

Here’s an example. I took about 1000 men from the GSS, and compared their age and income. Between the ages of 25 and 59, older men have higher average incomes, but the fit is curved with a peak around 45. Here is the relationship, again using jittering to show all the individuals, with a linear regression line. The correlation is .23

c1That might be nice to look at but it’s hard to see the underlying relationship. It’s hard to even see how the fitted line relates to the data. So you might reduce it by showing the average income at each age. By pulling the points together vertically into average bins, this shows the relationship much more clearly. However, it also makes the relationship look much stronger. The correlation in this figure is .65. Now the reader might think, “Whoa.”

c2Note this didn’t change the slope much (it still runs from about $30k to $60k), it just put all the dots closer to the line. Finally, here it is pulling the averages together in horizontal bins, grouping the ages in fives (25-29, 30-34 … 55-59). The correlation shown here is .97.


If you’re like me, this is when you figured out that reducing this to two dots would produce a correlation of 1.0 (as long as the dots aren’t exactly level).

To make good data presentation tradeoffs requires experimentation and careful exposition. And, of course, transparency. My code for this post is available on the Open Science Framework here (you gotta get the GSS data first).


Filed under In the news

Visualizing family modernization, 1900-2016

After this post about small multiple graphs, and partly inspired by two news reports I was interviewed for — this Salt Lake Tribune story about teen marriage, and this New York Times report mapping age at first birth — I made some historical data figures.

These visualizations use decennial census data from 1900 to 1990, and then American Community Survey data for 2001, 2010, and 2016; all data from IPUMS.org. (I didn’t use the 2000 Census because marital status is messed up in that data, with a lot of people who should be never married coded as married, spouse absent; 2001 ACS gets it done.)

An important, simple way of illustrating the myth-making around the 1950s is with marriage age. Contrary to the myth that the 1950s was “traditional,” a long data series show the period to be unique. The two trends here, teen marriage and divorce, both show the modernization of family life, with increasing individual self-determination and less restricted family choices for women.

First, I show the proportion of teenage women married in each state, for each decade from 1900 to 2016. The measure I used for this is the proportion of 19- and 20-year-olds who have ever been married (that is, including those married, divorced, and widowed). It’s impossible to tell exactly how many people were married before their 20th birthday, which would be a technical definition of teen marriage, but the average of 19 and 20 should do it, since it includes some people are on the first day of their 19th year, and some people are on the last day of their 20th, for an average close to exact age 20.

I start with a small multiple graph of the trend on this measure in every state (click all figures to enlarge). Here the states are ordered by the level of teen marriage in 2016, from Maine lowest (<1%) to Utah (14%):

teen marriage 1900-2016

This is useful for seeing that the basic pattern is universal: starting the century lower and rising to a peak in 1960, then declining steeply to the present. But that similarity, and smaller range in the latest data, make it hard to see the large relative differences across states now. Here are the 2016 levels, showing those disparities clearly:

teen marriage states 2016.xlsx

Neither the small multiples nor the bars help you see the regional patterns and variations. So here’s an animated map that shows both the scale of change and the pattern of variation.


This makes clear the stark South/non-South divide, and how the Northeast led the decline in early marriage. Also, you can see that Utah, which is such a standout now, did not have historically high teen marriage levels, the state just hasn’t matched the decline seen nationally. Their premodernism emerged only in relief.


Here I again used a prevalence measure. This is just the number of people whose marital status is divorced, divided by the number of married people (including separated and divorced). It’s a little better than just the percentage divorced in the population, because it’s at least scaled by marriage prevalence. But it doesn’t count divorces happening, and it doesn’t count people who divorced and then remarried (so it will under-represent divorce to the extent that people remarry). Also, if divorced people die younger than married people, it could be messed up at older ages. Anyway, it’s the best thing I could think of for divorce rates by state all the way back to 1900.

So, here’s the small multiple graph, showing the trend in divorce prevalence for all states from 1900 to 2016:


That looks like impressive uniformity: gradual increase until 1970, then a steep upward turn to the present. These are again ordered by the 2016 value, from Utah at less than 20% to New Mexico at more than 30% — smaller variation than we saw in teen marriage. That steep increase looks dramatic in the animated map, which also reveals the regional patterns:



The strategy for both trends is to download microdata samples from all years, then collapse the files down to state averages by decade. The linear figures are Stata scatter plots by state. The animated maps use maptile in Stata (by Michael Stepner) to make separate image files for each map, which I then imported into Photoshop to make the animations (following this tutorial).

The downloaded data, codebooks, Stata code, and images, are all available in an Open Science Framework project here. Feel free to adapt and use. Happy to hear suggestions and alternative techniques in the comments.


Filed under Me @ work

African American marital status by age, Du Bois replication edition

At the 1900 Paris Exposition, sociologist W. E. B. Du Bois presented some the work of his students. In The Scholar Denied: W. E. B. Du Bois and the Birth of Modern Sociology, Aldon Morris writes:

Du Bois’s meticulousness as a teacher is apparent in the charts and graphs that he prepared with his students. For example, as part of his gold medal-winning exhibit for the 1900 Paris Exposition, Du Bois and his students produced detailed hand-drawn artistically colored graphs and charts that depicted the journey of black Georgians from slavery to freedom.

Some of collection is shown in this post at the Public Domain Review (shared by Tressie McMillan Cottom yesterday); the full collection is online at the Library of Congress (LOC).

The one that caught my eye was this, showing marital status (“conjugal condition”) by age and sex for the Black population. I can’t find the source details in the LOC record, so I don’t know if it’s Georgia or national, but I presume it’s from tabulations of 1890 decennial census or earlier:


It’s artistic and meticulous and clearly informative, beautiful. So I tried to make a 2015 update to complement it. I used data from the 2015 American Community Survey via IPUMS.org, and did it a little differently.* Most importantly, I added two more conjugal conditions, cohabiting and separated/divorced. Second, I used five-year age groupings all the way up, instead of ten. Third, I detailed the age groups up to age 85. Here’s what I got:

du bois marstat replication.xlsx

Some very big differences: Much smaller proportions of African Americans married now. Also, much later marriage. In the 1900 figure more than 30% of men and 60% of women have been married by age 25; those numbers are 5-6% now. I don’t know how they counted separated/divorced people in 1900, but those numbers are high now at 31% for women and 24% for men at age 60-64. Widowhood is later now, as 42% of women were widowed before age 65 in 1900, compared with only 13% now (of course, that’s off a lower marriage rate, and remarried people are just counted as married). And of course cohabitation, which the chart doesn’t show for 1900. Note I included people in same-sex as well as different-sex couples.

So, thanks for indulging me. I hope you don’t think it’s frivolous. I just love staring at the old charts, and going through the (very different) steps of replicating it was really satisfying. (I also just love that in another 100 years someone might look back on this and say, “Wait, which one was Earth again?”)

Note: If you want to compare them side-by-side, here’s a go at that. The age ranges don’t line up perfectly but you can get the idea (click to enlarge):

* SAS code, ACS data, images, and the spreadsheet used for this post are shared as an Open Science Framework project, here.


Filed under Me @ work

Marriage and gender inequality in 124 countries

Countries with higher levels of marriage have higher levels of gender inequality. This isn’t a major discovery, but I don’t remember seeing this illustrated before, so I decided to do it. Plus I’m trying to improve my Stata graphing.

I used data from this U.N. report on marriage rates from 2008, restricted to those countries that had data from 2000 or later. To show marriage rates I used the percentage of women ages 30-34 that are currently married. This is thus a combination of marriage prevalence and marriage timing, which is something like the amount of marriage in the country. I got gender inequality from the U.N. Development Programme’s Human Development Report for 2015. The gender inequality index combines the maternal mortality ratio, the adolescent birth rate, the representation of women in the national parliament, the gender gap in secondary education, and the gender gap in labor market participation.

Here is the result. I labeled countries with 49 million population or more in red; a few interesting outliers are also labeled. The line is quadratic, unweighted for population (click to enlarge).

You can see the USA sliding right down that curve toward gender nirvana (not that I’m making a simplistic causal argument).

Note that India and China together are about 36% of the world’s population. They both have nearly universal marriage by age 30-34, but women in China get married about four years later on average. That’s an important part of why China has lower gender inequality (it goes along with more educational access, higher employment levels, politics, history, etc.). China is a major outlier among universal-marriage countries, while India is right on the curve.

Any cross-national comparison has to handle this issue. China is 139-times bigger than Sweden. One way to address it is to weight the points by their relative population sizes. If you do that it actually doesn’t change the result much, except for China, which in this cases changes everything because in addition to being huge they broke the relationship between marriage and gender inequality. Here is the comparison. Now the dots are scaled for population, and the gray line is fit to all the countries except China, while the red line includes China (click to enlarge).

My conclusion is that the gray line is the basic story — more marriage, more gender inequality — with China as an important exception, but that’s up for interpretation.

I put the data and the code for making the charts in this directory. Feel free to copy and crib, etc.


Filed under Me @ work

Visualizing attitude differences

This didn’t turn into something more substantial, so I’m just leaving it here as is. I like the idea of visualizing attitude (or other) differences by race/ethnicity, sex, and generation (or other characteristics) with distances. Plus my daughter is learning about x-y coordinates in math.

I got these just using the General Social Survey online analysis tool (here). These are the question texts.

HELPBLK (1-5):

Some people think that African-Americans have been discriminated against for so long that the government has a special obligation to help improve their living standards; they are at point 1. Others believe that the government should not be giving special treatment to African-Americans; they are at point 5. a. Where would you place yourself on this scale, or haven’t you made up your mind on this?

FEFAM: Strongly agree to strongly disagree (1-4):

It is much better for  everyone involved if the man is the achiever outside the home and the woman takes care of the home and family.

I used two categories each of race (Black/White), Sex (man/woman), and Generation (18-44/45+). Scores are shown as differences between each group’s mean score and the population average.

Short story: Whites show little difference by age, gender, or generation on the race question, but big differences by gender and age on the gender question. Blacks are similar to Whites on the gender question except that younger Blacks men are less opposed to breadwinner-homemaker family arrangements than are younger Whites (especially men).

I also added some other groupings for comparison:

No real point to make about this except that I like the idea of representing these patterns like this. Someone should make a GSS tool that does this for you on any question. With confidence intervals. (If there are already are such tools, please advise.)

1 Comment

Filed under Me @ work

NYT magazine infographic: not just dumb and annoying

This graphic from the New York Times magazine is bad data presented poorly (and reproduced poorly, by my camera phone):


It’s presented poorly because those blood stains are impossible to compare since you can’t discern their edges, and it appears they don’t taper toward the edges at the same rate. Maybe they simply resized one of them to get the relative size, which would be wrong. Anyway, if they cared about communicating the data they probably would have used real data in the first place. (You could also complain that a red speckle-cloud is unfriendly to some color-blind people.)

It’s bad data because it’s an online NYT reader survey, which — although it’s from the “research and analytics” department (and no, I’m not going to add “analytics” to my Windows dictionary) — represents unknown sample selection effects on an undefined population. In other words, who cares what they think?

A survey like that would be a start if it was the only way you had to answer an important or hard-to-measure issue, and if you clearly stated that it was likely unreliable. But in this case there is good, nationally-representative data on this very question. So if NYT Magazine wanted to inform its readers of something, they could have used this.

Here’s the good data — from the General Social Survey — in a graph that is at least a lot better: this is good data in a chart that’s easier to read accurately, includes a breakout by strength of opinion, and uses more accessible colors (click to enlarge).

gss spank 2014.xlsx

I think the NYT Magazine graphics violations are not just dumb and annoying — here’s another post all about them — I think they harm the public good. Graphics like this spread ignorance and contribute to the perception that statistics – especially graphic statistics – are just an arbitrary way of manipulating people rather than a set of tools for exploring data and attempting to answer real questions. (If you want awesome real graphics, check out Healy and Moody’s Annual Review of Sociology paper.)

P.S., I wrote more about spanking here.

Leave a comment

Filed under In the news