Demonstrating Cell Key Perturbation in Python

Picture: Photo by Ivan on Stockvault

Over the past few months, the Sensible Code team has put together an example implementation of the cell key method, a perturbation method that adds noise to frequency tables in order to protect against differencing attacks. It is one of the disclosure control algorithms in our product Cantabular. This example has been written using Python in a Jupyter notebook.

It was proposed for two reasons:

  1. To explain in principle how the algorithm works
  2. To provide an open-source learning resource for customers and anyone interested in learning about the method.

The Sensible Code Company works with statisticians and data controllers to help improve business operations that require the processing of confidential data.

Cell Key Method

Cell key is a perturbative disclosure control method that adds small amounts of noise to some cells in a frequency table. The noise added is based on a cell’s value and a ‘cell key’ calculated for each cell. Cell key values are not unique and the same noise can be added to multiple cells. It’s important to note that the noise is not added randomly: if the same query is run multiple times, the query will always be perturbed in the same way.

In order to apply the cell key method, a microdata file must be supplied containing categorical data. An unperturbed frequency table is generated based on the variables defined, then a ‘cell key’ for each value is calculated by summing the record keys - integers randomly assigned to each row - from the contributing rows in the source microdata. A lookup is then performed on a perturbation table such as a CSV file containing a list of perturbation values specifying how much noise should be applied to a cell. The final output is a perturbed frequency table that is no longer disclosive.

Figure 1 provides an illustrative example of the cell key perturbation method.

Figure 1: Applying Cell-Key Perturbation to 2021 Census Outputs (The Office for National Statistics)

Figure 1: Applying Cell-Key Perturbation to 2021 Census Outputs (The Office for National Statistics)

Python Implementation

A Jupyter notebook containing the example Python implementation is publicly available here. It uses a categorical dataset of penguin species characteristics borrowed from the Seaborn project and a simple CSV file to define the perturbation to be applied.

Example

This example shows an output from the Jupyter notebook using 3 variables from the penguin species dataset.

Figures 2 and 3 show the output generated for all 3 variables before and after perturbation respectively. Figure 2 contains the unperturbed frequency counts, before values from the ptable are applied. Figure 3 contains the perturbed counts; cells with noise added have been highlighted.

Figure 2: Unperturbed frequency table

Figure 2: Unperturbed frequency table

Figure 3: Perturbed frequency table

Figure 3: Perturbed frequency table

Cantabular uses a wider range of cell keys and hence we cannot compare the output from this simplified implementation with a Cantabular output on the same dataset. In order to do this, the cell keys would need to be adjusted.

This cell key Python implementation could be applied to a variety of different datasets.

Cell key in Cantabular & the ONS

Over the past few years, Sensible Code has been busy building Cantabular: an innovative software framework for protecting and publishing statistical data.

Cantabular works by programmatically adding noise, and hence uncertainty, to outputs and screening queries and output tables for disclosure risks. The cell key method is used in Cantabular because, whilst it removes the need for individual checking of tables, it also protects against differencing. Differencing is the process of comparing multiple similar tables to reveal unpublished and disclosive information.

The UK based Office for National Statistics (ONS) has selected Cantabular to allow flexible dissemination of confidential Census 2021 data. Census is a critically important dataset for the ONS and information assurance is paramount. For the census 2011, after the standard releases, outputs could be requested by users, which then had to be individually created and assessed for disclosure risk by the SDC team1. Cantabular reduces the delay between data collection and publication and the API allows for statisticians to automate tasks such as querying the Census data.

For more information about Cantabular and the ONS, see our blog post “Cantabular product launch”.

Want to find out more?

If you’d like to find out more about how our product Cantabular works or the Python cell key implementation, we’d love to hear from you. Drop us an email at hello@sensiblecode.io.

Get in touch to find out more