Image copyright U.S. Census Bureau Graphic of United States for 2020 Census

3 minute read

Explore confidence intervals of the US Census 2020 data with Cantabular

We’ve built a demonstration US Census 2020 Tabulator with Cantabular which allows you display the confidence interval for every statistic that it produces. This tool is using data from the U.S. Census Bureau 2020 Census Privacy-Protected Microdata File (PPMF) together with the associated 50 replicate PPMFs. The latter are used in order to calculate the confidence interval for all cells in any cross-tabulation on the fly.

In our demonstration we provide data for people across four states: Idaho, Nevada, Oregon and Utah. If you are interested in data for more states or for households then please get in touch using our contact form.

Why are confidence intervals important?

The US Census PPMF is the result of a privacy preserving process which means the contents are not the exact results of the census. Instead they undergo significant randomisation using a US Census Bureau designed process known as the TopDown Algorithm (TDA). Confidence intervals can tell you the accuracy of any statistic computed from the PPMF.

Each statistic that you can generate from the PPMF is not similarly accurate. The randomisation process models different relationships in the data with different degrees of fidelity. The simplest example is that populations at state level and above are exact, but populations at other geographic breakdowns such as congressional districts are not.

A more complex example with two statistics of similar magnitude (about 35,000):

Thus one of these statistics is roughly ten times more accurate than the other.

Why is calculating confidence intervals challenging?

In order to compute the confidence interval for a single statistic you have to compute that statistic for the master microdata (PPMF0) and also for each of the fifty additional replicate PPMFs. An output table can contains very many cells and each cell is a statistic which has to undergo this process. For the whole USA this amounts to scanning about 17 billion rows of microdata. Cantabular can achieve this in times of the order of a second on widely available cloud server machines.

Once it has computed the 50 additional statistics, Cantabular uses the method outlined in the US Census bureau Github repository for calculating the 90% confidence interval.

Can I see confidence intervals graphically?

In our dashboards you can see the variance on our bar charts and sample population pyramid charts using conventional error bars against each bar. See our image below or have a go yourself.

Cantabular Dashboard Area Profile with Confidence Intervals

Explore our US Census 2020 Area Profile for insights about the population and people in your local area.

Can I get more data?

If you are interested in data for more states or for households then please get in touch using our contact form.

More examples of Cantabular

  • Office for National Statistics, UK
    The Create a custom data set tool is powered by Cantabular and applies disclosure avoidance in real time using our Disclosure Rules Language (DRL).

  • Northern Ireland Statistics & Research Agency (NISRA)
    The Flexible Table Builder is built on Cantabular is a repeatable analytical pipeline for Census 2021 data with disclosure avoidance applied in real time.

  • Historic 1911 Ireland Census Preview
    The Interactive table explorer is built on Cantabular to provide an interactive preview to explore age, sex, religious professions, literacy, Irish language, birthplace.