Cell Key in Pandas¶

A simplified open source Python demonstration of the cell key method using the pandas library. This is one of the disclosure control algorithms in our product Cantabular - an innovative software framework for protecting and publishing statistical data.

Cell key is a perturbative disclosure control method that adds small amounts of noise to some cells in a frequency table that is produced from microdata containing categorical data. For more information on the cell key method and Cantabular, see our "Demonstrating the cell key method in Python" blog post.

This cell key Python approach can be applied to a variety of different datasets.

The UK based Office for National Statistics (ONS) has selected Cantabular to allow flexible dissemination of confidential Census 2021 data.

Written by Mattie Phillips and Peter Hynes at The Sensible Code Company.

Microdata¶

To start, we need a microdata CSV file containing categorical data.

In this example we are using a CSV of penguin characteristics borrowed from the Seaborn project.

In [8]:

import pandas as pd
import numpy as np
import io
import random

penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

display(penguins)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	MALE
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	FEMALE
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	FEMALE
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	FEMALE
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	FEMALE
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	MALE
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	FEMALE
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	MALE

344 rows × 7 columns

We will keep only the species, sex and bill_length_mm columns for this example. (Any number of variables can be used in this demonstration).

The bill lengths are floating point numbers with 1 decimal place - let's round these values to the nearest integer.

Also, some of the records do not have a recorded bill length (bill_length_mm) or sex (sex) so we remove those rows. An alternative option here would be to map any missing values to a Not specified category in order to avoid removing rows.

In [9]:

penguins = penguins[["species","sex","bill_depth_mm"]]
penguins = penguins[penguins.bill_depth_mm.notna() & penguins.sex.notna()]

penguins["bill_depth_mm"] = penguins["bill_depth_mm"].round(0).astype(int)

display(penguins)

	species	sex	bill_depth_mm
0	Adelie	MALE	19
1	Adelie	FEMALE	17
2	Adelie	FEMALE	18
4	Adelie	FEMALE	19
5	Adelie	MALE	21
...	...	...	...
338	Gentoo	FEMALE	14
340	Gentoo	FEMALE	14
341	Gentoo	MALE	16
342	Gentoo	FEMALE	15
343	Gentoo	MALE	16

333 rows × 3 columns

Row keys (or record keys)¶

Row keys are assigned to each row in the microdata. These are integer values and are later used to calculate cell keys for each cell in a frequency table. The cell keys are critical in determining the amount of noise to apply to each cell.

For simplicity, we have added row keys in the range 0 - 3. Typically this range would be much higher.

In [10]:

np.random.seed(0)
penguins.insert(0, "row_key",
                np.random.randint(0, 4, penguins.shape[0]))
display(penguins)

	row_key	species	sex	bill_depth_mm
0	0	Adelie	MALE	19
1	3	Adelie	FEMALE	17
2	1	Adelie	FEMALE	18
4	0	Adelie	FEMALE	19
5	3	Adelie	MALE	21
...	...	...	...	...
338	3	Gentoo	FEMALE	14
340	1	Gentoo	FEMALE	14
341	2	Gentoo	MALE	16
342	3	Gentoo	FEMALE	15
343	1	Gentoo	MALE	16

333 rows × 4 columns

Cross tabulation¶

Now we'll use the pandas crosstab function to create a cross tabulation of variables in the modified penguins dataset.

Here, the index is made up of all variables in penguins except the final one. The columns are then made up of the final column. The variables being used in the index are species and sex. The variable used in the columns is bill_depth_mm.

A dataframe is output with a multi-level index. This is the unperturbed frequency table.

In [11]:

counts = pd.crosstab(
    [penguins[v] for v in list(penguins.columns)[1:-1]],
    penguins[list(penguins.columns)[-1]],
    dropna=False)
display(counts)

	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	8	24	28	12	0	1	0
Adelie	MALE	0	0	0	0	3	21	27	14	7	1
Chinstrap	FEMALE	0	0	0	2	14	13	5	0	0	0
Chinstrap	MALE	0	0	0	0	0	6	12	14	2	0
Gentoo	FEMALE	4	38	15	1	0	0	0	0	0	0
Gentoo	MALE	0	4	19	31	7	0	0	0	0	0

Cell keys (or ckeys)¶

Next, we need to calculate the cell keys for each cell in the unperturbed frequency table.

For each cell, we used the pandas crosstab function but to sum the rkeys from the contributing rows in the microdata rather than creating a frequency table. Here, the index is made up of all variables in penguins except the final one. The columns are then made up of the final column. The variables being used in the index are species and sex. The variable used in the columns is bill_depth_mm.

We then use the modulo operation to calculate a remainder when dividing by a value. In this case, we are using .mod(4) because our ptable (see below) contains only values 0 to 3.

Python integers do not have a maximum size, therefore we do not need to worry about integer overflow.

The resultant dataframe has the same structure and cell combinations as counts.

In [12]:

sum_rkey = pd.pivot_table(
    penguins,
    values=['row_key'],
    index=list(penguins.columns)[1:-1],
    columns=[list(penguins.columns)[-1]],
    aggfunc=np.sum,
    fill_value=0,
    dropna=False)

cell_keys = sum_rkey.mod(4)

display(cell_keys)

		row_key
	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	3	0	2	0	0	3	0
Adelie	MALE	0	0	0	0	2	0	3	0	1	2
Chinstrap	FEMALE	0	0	0	0	3	2	1	0	0	0
Chinstrap	MALE	0	0	0	0	0	2	3	0	2	0
Gentoo	FEMALE	0	0	0	1	0	0	0	0	0	0
Gentoo	MALE	0	3	0	2	2	0	0	0	0	0

Perturbation table¶

Some cells in the frequency table are perturbed. The amount of perturbation is determined by the cell key and frequency value for each cell. A perturbation table (ptable) is used to specify the amount of perturbation to apply for each combination of cell key and frequency value.

Here we have added frequency values up to 3 and then reuse these entries for higher frequency values. In a full implementation it would be possible to cycle though a range of values. Using a small range for valid cell keys and frequency values allows us to create a small perturbation table that can be easily understood.

The structure of this ptable is a pandas dataframe with a multiindex of cell_key and value.

In [13]:

data = io.StringIO('''
value,cell_key,peturbation
  ,0       ,0
  ,1       ,1
  ,2       ,-1
  ,3       ,0
  ,0       ,2
  ,1       ,0
  ,2       ,-1
  ,3       ,-1
  ,0       ,-3
  ,1       ,1
  ,2       ,1
  ,3       ,2
''')

ptable = pd.read_csv(data).set_index(["value", "cell_key"])
display(ptable)

		peturbation
value	cell_key
1	0	0
	1	1
	2	-1
	3	0
2	0	2
	1	0
	2	-1
	3	-1
3	0	-3
	1	1
	2	1
	3	2

Applying perturbation¶

Now that we have a table of cell keys, we can use our ptable to determine the required perturbation for each cell in counts.

A dataframe with the same structure as counts is created containing all zeros.

For every cell in counts, the corresponding cell value in cell_keys is extracted. The values and cell keys are then mapped to ptable and, if not equal to 0, the resultant perturbation value overwrites the 0 value in that cell. Cells with a value greater than 3 will use the value of 3 in the ptable.

In [14]:

perturbations = counts * 0

for col_idx in range(counts.shape[0]):
    for row_idx in range(counts.shape[1]):
        cell_key = cell_keys.iat[col_idx, row_idx]
        count = counts.iat[col_idx,row_idx]
        perturbation = 0
        if count > 0:
            perturbation = ptable.loc[min(count, 3), cell_key]
        perturbations.iat[col_idx, row_idx] += perturbation

display(perturbations)

	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	2	-3	1	-3	0	0	0
Adelie	MALE	0	0	0	0	1	-3	2	-3	1	-1
Chinstrap	FEMALE	0	0	0	2	2	1	1	0	0	0
Chinstrap	MALE	0	0	0	0	0	1	2	-3	-1	0
Gentoo	FEMALE	-3	-3	-3	1	0	0	0	0	0	0
Gentoo	MALE	0	2	-3	1	1	0	0	0	0	0

Perturbed counts¶

Now we have our unperturbed counts table and our perturbation table, we can sum the two to create a perturbed counts table. The addition of noise means that it is no longer possible to know which small values are real therefore making the table less disclosive.

In [15]:

perturbed_counts = counts + perturbations
display(perturbed_counts)

	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	10	21	29	9	0	1	0
Adelie	MALE	0	0	0	0	4	18	29	11	8	0
Chinstrap	FEMALE	0	0	0	4	16	14	6	0	0	0
Chinstrap	MALE	0	0	0	0	0	7	14	11	1	0
Gentoo	FEMALE	1	35	12	2	0	0	0	0	0	0
Gentoo	MALE	0	6	16	32	8	0	0	0	0	0

Closer look at one cell¶

Here we will look at one specific cell of our frequency table to walk through how cell key perturbation is applied to it.

We will take, from penguins, any rows for MALE Adelie species with a bill depth of 17mm. 3 rows are returned giving us an unperturbed count of 3.

In [16]:

male_adelie_17 = penguins[penguins.sex.isin(["MALE"]) & penguins.species.isin(["Adelie"]) & penguins.bill_depth_mm.isin([17])]
display(male_adelie_17)
display(len(male_adelie_17))

	row_key	species	sex	bill_depth_mm
24	1	Adelie	MALE	17
141	1	Adelie	MALE	17
143	0	Adelie	MALE	17

If we sum the row_key's and take the modulo 4 sum, we get a result of 2. Therefore our cell_key is 2.

Next, we map our value and cell_key to the ptable. This gives a perturbation value of 1.

Finally, we add the perturbation value to the original counts value. We get a perturbed value of 4.

Perturbation colour scale¶

The dataframe below highlights perturbed cells with the colour dependent on the amount of perturbation applied. Unhighlighted cells have not had perturbation applied.

In [21]:

display(perturbed_counts.style.apply(apply_color, axis=None))

	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	10	21	29	9	0	1	0
Adelie	MALE	0	0	0	0	4	18	29	11	8	0
Chinstrap	FEMALE	0	0	0	4	16	14	6	0	0	0
Chinstrap	MALE	0	0	0	0	0	7	14	11	1	0
Gentoo	FEMALE	1	35	12	2	0	0	0	0	0	0
Gentoo	MALE	0	6	16	32	8	0	0	0	0	0

The original unperturbed counts table is shown here for comparison

In [22]:

display(counts)

	bill_depth_mm	13	14	15	16	17	18	19	20	21	22
species	sex
Adelie	FEMALE	0	0	0	8	24	28	12	0	1	0
Adelie	MALE	0	0	0	0	3	21	27	14	7	1
Chinstrap	FEMALE	0	0	0	2	14	13	5	0	0	0
Chinstrap	MALE	0	0	0	0	0	6	12	14	2	0
Gentoo	FEMALE	4	38	15	1	0	0	0	0	0	0
Gentoo	MALE	0	4	19	31	7	0	0	0	0	0