A simplified open source Python demonstration of the cell key method using the pandas library. This is one of the disclosure control algorithms in our product Cantabular - an innovative software framework for protecting and publishing statistical data.
Cell key is a perturbative disclosure control method that adds small amounts of noise to some cells in a frequency table that is produced from microdata containing categorical data. For more information on the cell key method and Cantabular, see our "Demonstrating the cell key method in Python" blog post.
This cell key Python approach can be applied to a variety of different datasets.
The UK based Office for National Statistics (ONS) has selected Cantabular to allow flexible dissemination of confidential Census 2021 data.
Written by Mattie Phillips and Peter Hynes at The Sensible Code Company.
To start, we need a microdata CSV file containing categorical data.
In this example we are using a CSV of penguin characteristics borrowed from the Seaborn project.
import pandas as pd
import numpy as np
import io
import random
penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
display(penguins)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE |
344 rows × 7 columns
We will keep only the species
, sex
and bill_length_mm
columns for this example. (Any number of variables can be used in this demonstration).
The bill lengths are floating point numbers with 1 decimal place - let's round these values to the nearest integer.
Also, some of the records do not have a recorded bill length (bill_length_mm
) or sex (sex
) so we remove those rows. An alternative option here would be to map any missing values to a Not specified
category in order to avoid removing rows.
penguins = penguins[["species","sex","bill_depth_mm"]]
penguins = penguins[penguins.bill_depth_mm.notna() & penguins.sex.notna()]
penguins["bill_depth_mm"] = penguins["bill_depth_mm"].round(0).astype(int)
display(penguins)
species | sex | bill_depth_mm | |
---|---|---|---|
0 | Adelie | MALE | 19 |
1 | Adelie | FEMALE | 17 |
2 | Adelie | FEMALE | 18 |
4 | Adelie | FEMALE | 19 |
5 | Adelie | MALE | 21 |
... | ... | ... | ... |
338 | Gentoo | FEMALE | 14 |
340 | Gentoo | FEMALE | 14 |
341 | Gentoo | MALE | 16 |
342 | Gentoo | FEMALE | 15 |
343 | Gentoo | MALE | 16 |
333 rows × 3 columns
Row keys are assigned to each row in the microdata. These are integer values and are later used to calculate cell keys for each cell in a frequency table. The cell keys are critical in determining the amount of noise to apply to each cell.
For simplicity, we have added row keys in the range 0 - 3. Typically this range would be much higher.
np.random.seed(0)
penguins.insert(0, "row_key",
np.random.randint(0, 4, penguins.shape[0]))
display(penguins)
row_key | species | sex | bill_depth_mm | |
---|---|---|---|---|
0 | 0 | Adelie | MALE | 19 |
1 | 3 | Adelie | FEMALE | 17 |
2 | 1 | Adelie | FEMALE | 18 |
4 | 0 | Adelie | FEMALE | 19 |
5 | 3 | Adelie | MALE | 21 |
... | ... | ... | ... | ... |
338 | 3 | Gentoo | FEMALE | 14 |
340 | 1 | Gentoo | FEMALE | 14 |
341 | 2 | Gentoo | MALE | 16 |
342 | 3 | Gentoo | FEMALE | 15 |
343 | 1 | Gentoo | MALE | 16 |
333 rows × 4 columns
Now we'll use the pandas crosstab
function to create a cross tabulation of variables in the modified penguins dataset.
Here, the index is made up of all variables in penguins
except the final one. The columns are then made up of the final column. The variables being used in the index are species
and sex
. The variable used in the columns is bill_depth_mm
.
A dataframe is output with a multi-level index. This is the unperturbed frequency table.
counts = pd.crosstab(
[penguins[v] for v in list(penguins.columns)[1:-1]],
penguins[list(penguins.columns)[-1]],
dropna=False)
display(counts)
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 8 | 24 | 28 | 12 | 0 | 1 | 0 |
MALE | 0 | 0 | 0 | 0 | 3 | 21 | 27 | 14 | 7 | 1 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 2 | 14 | 13 | 5 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 6 | 12 | 14 | 2 | 0 | |
Gentoo | FEMALE | 4 | 38 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 4 | 19 | 31 | 7 | 0 | 0 | 0 | 0 | 0 |
Next, we need to calculate the cell keys for each cell in the unperturbed frequency table.
For each cell, we used the pandas crosstab
function but to sum the rkeys from the contributing rows in the microdata rather than creating a frequency table. Here, the index is made up of all variables in penguins
except the final one. The columns are then made up of the final column. The variables being used in the index are species
and sex
. The variable used in the columns is bill_depth_mm
.
We then use the modulo operation to calculate a remainder when dividing by a value. In this case, we are using .mod(4)
because our ptable (see below) contains only values 0 to 3.
Python integers do not have a maximum size, therefore we do not need to worry about integer overflow.
The resultant dataframe has the same structure and cell combinations as counts
.
sum_rkey = pd.pivot_table(
penguins,
values=['row_key'],
index=list(penguins.columns)[1:-1],
columns=[list(penguins.columns)[-1]],
aggfunc=np.sum,
fill_value=0,
dropna=False)
cell_keys = sum_rkey.mod(4)
display(cell_keys)
row_key | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 3 | 0 | 2 | 0 | 0 | 3 | 0 |
MALE | 0 | 0 | 0 | 0 | 2 | 0 | 3 | 0 | 1 | 2 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 0 | 3 | 2 | 1 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 2 | 0 | |
Gentoo | FEMALE | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 3 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 |
Some cells in the frequency table are perturbed. The amount of perturbation is determined by the cell key and frequency value for each cell. A perturbation table (ptable) is used to specify the amount of perturbation to apply for each combination of cell key and frequency value.
Here we have added frequency values up to 3 and then reuse these entries for higher frequency values. In a full implementation it would be possible to cycle though a range of values. Using a small range for valid cell keys and frequency values allows us to create a small perturbation table that can be easily understood.
The structure of this ptable is a pandas dataframe with a multiindex of cell_key
and value
.
data = io.StringIO('''
value,cell_key,peturbation
1 ,0 ,0
1 ,1 ,1
1 ,2 ,-1
1 ,3 ,0
2 ,0 ,2
2 ,1 ,0
2 ,2 ,-1
2 ,3 ,-1
3 ,0 ,-3
3 ,1 ,1
3 ,2 ,1
3 ,3 ,2
''')
ptable = pd.read_csv(data).set_index(["value", "cell_key"])
display(ptable)
peturbation | ||
---|---|---|
value | cell_key | |
1 | 0 | 0 |
1 | 1 | |
2 | -1 | |
3 | 0 | |
2 | 0 | 2 |
1 | 0 | |
2 | -1 | |
3 | -1 | |
3 | 0 | -3 |
1 | 1 | |
2 | 1 | |
3 | 2 |
Now that we have a table of cell keys, we can use our ptable to determine the required perturbation for each cell in counts
.
A dataframe with the same structure as counts
is created containing all zeros.
For every cell in counts
, the corresponding cell value in cell_keys
is extracted. The values and cell keys are then mapped to ptable
and, if not equal to 0, the resultant perturbation value overwrites the 0 value in that cell. Cells with a value greater than 3 will use the value of 3 in the ptable.
perturbations = counts * 0
for col_idx in range(counts.shape[0]):
for row_idx in range(counts.shape[1]):
cell_key = cell_keys.iat[col_idx, row_idx]
count = counts.iat[col_idx,row_idx]
perturbation = 0
if count > 0:
perturbation = ptable.loc[min(count, 3), cell_key]
perturbations.iat[col_idx, row_idx] += perturbation
display(perturbations)
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 2 | -3 | 1 | -3 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 1 | -3 | 2 | -3 | 1 | -1 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 2 | 2 | 1 | 1 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 1 | 2 | -3 | -1 | 0 | |
Gentoo | FEMALE | -3 | -3 | -3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 2 | -3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Now we have our unperturbed counts
table and our perturbation
table, we can sum the two to create a perturbed counts
table. The addition of noise means that it is no longer possible to know which small values are real therefore making the table less disclosive.
perturbed_counts = counts + perturbations
display(perturbed_counts)
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 10 | 21 | 29 | 9 | 0 | 1 | 0 |
MALE | 0 | 0 | 0 | 0 | 4 | 18 | 29 | 11 | 8 | 0 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 4 | 16 | 14 | 6 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 7 | 14 | 11 | 1 | 0 | |
Gentoo | FEMALE | 1 | 35 | 12 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 6 | 16 | 32 | 8 | 0 | 0 | 0 | 0 | 0 |
Here we will look at one specific cell of our frequency table to walk through how cell key perturbation is applied to it.
We will take, from penguins
, any rows for MALE Adelie species with a bill depth of 17mm. 3 rows are returned giving us an unperturbed count of 3.
male_adelie_17 = penguins[penguins.sex.isin(["MALE"]) & penguins.species.isin(["Adelie"]) & penguins.bill_depth_mm.isin([17])]
display(male_adelie_17)
display(len(male_adelie_17))
row_key | species | sex | bill_depth_mm | |
---|---|---|---|---|
24 | 1 | Adelie | MALE | 17 |
141 | 1 | Adelie | MALE | 17 |
143 | 0 | Adelie | MALE | 17 |
3
If we sum the row_key's and take the modulo 4 sum, we get a result of 2. Therefore our cell_key
is 2.
print(male_adelie_17.row_key.sum() % 4)
2
Next, we map our value
and cell_key
to the ptable. This gives a perturbation value of 1.
print(int(ptable.loc[3,2]))
1
Finally, we add the perturbation value to the original counts value. We get a perturbed value of 4.
print(perturbed_counts.iat[1,4])
4
The dataframe below highlights perturbed cells with the colour dependent on the amount of perturbation applied. Unhighlighted cells have not had perturbation applied.
def apply_color(x):
colors = {-3: 'red',-2: 'tomato', -1: 'lightcoral', 0: 'white', 1: 'lightgreen', 2: 'limegreen', 3: 'green'}
return perturbations.applymap(lambda val: 'background-color: {}'.format(colors.get(val,'')))
display(perturbed_counts.style.apply(apply_color, axis=None))
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 10 | 21 | 29 | 9 | 0 | 1 | 0 |
MALE | 0 | 0 | 0 | 0 | 4 | 18 | 29 | 11 | 8 | 0 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 4 | 16 | 14 | 6 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 7 | 14 | 11 | 1 | 0 | |
Gentoo | FEMALE | 1 | 35 | 12 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 6 | 16 | 32 | 8 | 0 | 0 | 0 | 0 | 0 |
The original unperturbed counts
table is shown here for comparison
display(counts)
bill_depth_mm | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|
species | sex | ||||||||||
Adelie | FEMALE | 0 | 0 | 0 | 8 | 24 | 28 | 12 | 0 | 1 | 0 |
MALE | 0 | 0 | 0 | 0 | 3 | 21 | 27 | 14 | 7 | 1 | |
Chinstrap | FEMALE | 0 | 0 | 0 | 2 | 14 | 13 | 5 | 0 | 0 | 0 |
MALE | 0 | 0 | 0 | 0 | 0 | 6 | 12 | 14 | 2 | 0 | |
Gentoo | FEMALE | 4 | 38 | 15 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
MALE | 0 | 4 | 19 | 31 | 7 | 0 | 0 | 0 | 0 | 0 |
If you have any questions about the cell key method or Cantabular, please feel free to get in touch with us at hello@sensiblecode.io.
This demonstration is also available in Jupyter notebook format.