Cell Key in Pandas

A simplified open source Python demonstration of the cell key method using the pandas library. This is one of the disclosure control algorithms in our product Cantabular - an innovative software framework for protecting and publishing statistical data.

Cell key is a perturbative disclosure control method that adds small amounts of noise to some cells in a frequency table that is produced from microdata containing categorical data. For more information on the cell key method and Cantabular, see our "Demonstrating the cell key method in Python" blog post.

This cell key Python approach can be applied to a variety of different datasets.

The UK based Office for National Statistics (ONS) has selected Cantabular to allow flexible dissemination of confidential Census 2021 data.

Written by Mattie Phillips and Peter Hynes at The Sensible Code Company.


Microdata

To start, we need a microdata CSV file containing categorical data.

In this example we are using a CSV of penguin characteristics borrowed from the Seaborn project.

We will keep only the species, sex and bill_length_mm columns for this example. (Any number of variables can be used in this demonstration).

The bill lengths are floating point numbers with 1 decimal place - let's round these values to the nearest integer.

Also, some of the records do not have a recorded bill length (bill_length_mm) or sex (sex) so we remove those rows. An alternative option here would be to map any missing values to a Not specified category in order to avoid removing rows.

Row keys (or record keys)

Row keys are assigned to each row in the microdata. These are integer values and are later used to calculate cell keys for each cell in a frequency table. The cell keys are critical in determining the amount of noise to apply to each cell.

For simplicity, we have added row keys in the range 0 - 3. Typically this range would be much higher.

Cross tabulation

Now we'll use the pandas crosstab function to create a cross tabulation of variables in the modified penguins dataset.

Here, the index is made up of all variables in penguins except the final one. The columns are then made up of the final column. The variables being used in the index are species and sex. The variable used in the columns is bill_depth_mm.

A dataframe is output with a multi-level index. This is the unperturbed frequency table.

Cell keys (or ckeys)

Next, we need to calculate the cell keys for each cell in the unperturbed frequency table.

For each cell, we used the pandas crosstab function but to sum the rkeys from the contributing rows in the microdata rather than creating a frequency table. Here, the index is made up of all variables in penguins except the final one. The columns are then made up of the final column. The variables being used in the index are species and sex. The variable used in the columns is bill_depth_mm.

We then use the modulo operation to calculate a remainder when dividing by a value. In this case, we are using .mod(4) because our ptable (see below) contains only values 0 to 3.

Python integers do not have a maximum size, therefore we do not need to worry about integer overflow.

The resultant dataframe has the same structure and cell combinations as counts.

Perturbation table

Some cells in the frequency table are perturbed. The amount of perturbation is determined by the cell key and frequency value for each cell. A perturbation table (ptable) is used to specify the amount of perturbation to apply for each combination of cell key and frequency value.

Here we have added frequency values up to 3 and then reuse these entries for higher frequency values. In a full implementation it would be possible to cycle though a range of values. Using a small range for valid cell keys and frequency values allows us to create a small perturbation table that can be easily understood.

The structure of this ptable is a pandas dataframe with a multiindex of cell_key and value.

Applying perturbation

Now that we have a table of cell keys, we can use our ptable to determine the required perturbation for each cell in counts.

A dataframe with the same structure as counts is created containing all zeros.

For every cell in counts, the corresponding cell value in cell_keys is extracted. The values and cell keys are then mapped to ptable and, if not equal to 0, the resultant perturbation value overwrites the 0 value in that cell. Cells with a value greater than 3 will use the value of 3 in the ptable.

Perturbed counts

Now we have our unperturbed counts table and our perturbation table, we can sum the two to create a perturbed counts table. The addition of noise means that it is no longer possible to know which small values are real therefore making the table less disclosive.

Closer look at one cell

Here we will look at one specific cell of our frequency table to walk through how cell key perturbation is applied to it.

We will take, from penguins, any rows for MALE Adelie species with a bill depth of 17mm. 3 rows are returned giving us an unperturbed count of 3.

If we sum the row_key's and take the modulo 4 sum, we get a result of 2. Therefore our cell_key is 2.

Next, we map our value and cell_key to the ptable. This gives a perturbation value of 1.

Finally, we add the perturbation value to the original counts value. We get a perturbed value of 4.

Perturbation colour scale

The dataframe below highlights perturbed cells with the colour dependent on the amount of perturbation applied. Unhighlighted cells have not had perturbation applied.

The original unperturbed counts table is shown here for comparison


If you have any questions about the cell key method or Cantabular, please feel free to get in touch with us at hello@sensiblecode.io.

This demonstration is also available in Jupyter notebook format.