8 minute read

Dr Peter Hynes

Protecting Magnitude Tables with Cantabular

When publishing statistical tables, ensuring privacy while maintaining data utility is a constant challenge. Cantabular is already used to create safe frequency tables—but what about magnitude tables? This blog describes the work we've done to extend Cantabular to support magnitude data.

Background

Cantabular is a system used by National Statistics Agencies (NSIs) to publish privacy-protected census tables, allowing users to request the specific tables they need on demand. So far, it has mainly been used to share counts—how many people fall into different categories, such as age groups or employment types.

We're exploring ways to extend Cantabular’s capabilities so it can also handle tables of measurements (magnitude data). This means publishing not just counts, but also sums, averages, and other statistical values while still protecting privacy. For example, instead of only showing how many people are in a certain income bracket, it could also show the total income earned by that group or the average income per person.

National Statistical Institutes (NSIs) already generate privacy-protected magnitude tables, so what makes Cantabular stand out?

Cantabular offers a user-friendly graphical interface, enabling non-expert users to create tables without requiring specialist skills. This makes data access and analysis more efficient and accessible.

It also automates the creation of privacy-protected tables, eliminating the need for time-consuming manual disclosure risk assessments (output checking) by data owners. This streamlining enhances both security and efficiency.

Additionally, Cantabular is highly flexible in deployment. It can function as a publicly accessible tool on the internet, expanding open data availability, or operate within a secure research environment, allowing controlled access to more sensitive data for researchers.

By adding magnitude table support to Cantabular, we can allow NSIs to maximize the utility of their datasets by massively increasing the number of tables that are available.

Ensuring safe and flexible access to statistical data

The role of frequency and magnitude tables

Frequency and magnitude tables are essential statistical outputs used in economic, demographic, and social statistics. Frequency tables count how often specific characteristics appear in a dataset, while magnitude tables present aggregated values like totals or means. For example, a census frequency table may show the number of people by age and ethnicity in a region, while a magnitude table might show their average income.

Protecting privacy with statistical disclosure control (SDC)

To safeguard privacy, Statistical Disclosure Control (SDC) techniques are used to identify and modify potentially disclosive tables to ensure safe publication. This is crucial to prevent the unintended disclosure of information about specific individuals or groups, which could compromise confidentiality and violate data protection regulations.

Flexible Table Builder: England, Wales and Northern Ireland 2021 Census data

When National Statistical Institutes (NSIs) conduct censuses, they collect vast amounts of data. While billions of possible frequency tables could be generated by summarizing census results across different variable combinations, often only a carefully selected subset is typically published after applying appropriate SDC methods to ensure confidentiality. This is because each output must be created and checked individually, while the entire set of outputs must also be assessed for disclosure risk as a whole. Generating and reviewing every possible table is infeasible.

Rather than limiting outputs to a predefined set of tables, the Office for National Statistics (ONS) and the Northern Ireland Statistics & Research Agency (NISRA) took a different approach for the 2021 Census. They aimed to maximize data utility by allowing users to query the microdata directly and generate any required frequency table. Cantabular applies SDC in real time as users make queries. This process is known as flexible dissemination.

Extending Cantabular to support magnitude tables

We are currently exploring methods to extend Cantabular so that it can create magnitude tables in the same way it creates frequency tables, with appropriate SDC methods applied in real time. We have worked alongside an NSI to identify an initial methodology, and implemented the algorithms in Cantabular. We have added support for magnitude data to allow the publication of safe tables through the application of appropriate SDC methods.

Protecting statistical outputs: frequency tables

Balancing risk and utility

Publishing data always carries a risk of disclosure, while modifying or suppressing data can reduce its utility. Selecting appropriate SDC methods requires balancing confidentiality protection with maintaining data usefulness.

Frequency tables can be protected by applying techniques such as:

Cell suppression – Hides small counts to prevent identification.
Perturbation – Introduces random noise to mask exact values.
Aggregation – Combines categories to reduce disclosure risk while preserving utility.

Cantabular’s approach to frequency table protection

Since Cantabular allows the creation of tables using arbitrary combinations of variables, it is crucial that the SDC methodology is robust against differencing. Differencing occurs when multiple tables generated from the same dataset are linked, allowing sensitive information to be inferred by comparing overlapping aggregates and deducing individual values.

Cantabular's core approach applies cell-key perturbation to all frequency tables, an SDC technique originally developed by the Australian Bureau of Statistics (ABS). This method adds pseudorandom noise to cell values, protecting small counts and ensuring that real undisclosed values cannot be inferred by comparing related tables, thereby safeguarding against differencing.

Additional measures include:

Zero perturbation algorithm – Adjusts certain zero-count cells for further disclosure protection.
Customizable disclosure rules – Prevent the publication of overly sparse or dominant tables, ensuring robust confidentiality while maintaining data utility.

Protecting statistical outputs: magnitude tables

Magnitude tables: the role of suppression

Cell suppression is the most common SDC method used to protect magnitude tables. Primary suppression hides cells with small counts (e.g. 3 or 5) or dominant contributors (e.g. the n, k rule might identify disclosive cells as those where the 2 largest units contribute more than 85% of the cell total) to prevent disclosure, while secondary suppression ensures that suppressed values cannot be recalculated from row and column totals by strategically suppressing additional cells.

Why is suppression unsuitable for flexible dissemination?

While effective, suppression poses several challenges:

Computational expense – Suppression can be resource-intensive, requiring complex calculations.
Utility reduction – Removing numerous cells can significantly reduce data utility.
Complexity in linked tables – Finding a sufficient suppression pattern requires considering all linked tables simultaneously, which is infeasible when a huge number of potential tables is available.

These limitations make suppression unsuitable for flexible dissemination, where tables are generated on demand.

Who else offers automated protection of magnitude tables?

Both Statistics Norway and the Australian Bureau of statistics have online systems that allow the creation of various statistical outputs, including frequency and magnitude tables.

Statistics Norway: microdata.no

Statistics Norway has developed microdata.no, an online system that allows registered users from approved Norwegian institutions to analyze sensitive microdata on individuals in Norway. Users can generate various outputs, including frequency tables, magnitude tables, and regressions, using a custom command language. SDC methods are applied in real time, eliminating the need for manual output checking.

Australian Bureau of Statistics: TableBuilder

Similarly, the Australian Bureau of Statistics (ABS) offers TableBuilder, which enables registered users to create frequency and magnitude tables (with a limited set of magnitude variables).

Common features

A key feature of both systems is user registration and activity logging, which helps identify and block malicious users attempting to reconstruct original data.

Like Cantabular, both systems employ cell-key perturbation to protect frequency tables. microdata.no ensures that small values (1-4) are perturbed beyond this range, preventing reliable differencing when comparing related tables.

Magnitude table protection: perturbation over suppression

The approaches for magnitude tables differ slightly, but both rely on perturbation-based techniques.

microdata.no employs the following statistical techniques:

Winsorization (top coding): Caps extreme values to prevent dominance.
Multiplicative Perturbation: Adjusts magnitude values in proportion to frequency changes (e.g. a 10% increase in frequency leads to a 10% increase in magnitude).
Disclosure Rules: Prevents publication if more than 50% of perturbed cells contain a value of zero.

TableBuilder applies a variant of cell-key perturbation called the top contributors method, which perturbs the largest contributors to a cell in a multiplicative manner, adding uncertainty to dominant values.

The approaches taken by both these NSIs rely on perturbation rather than suppression and give weight to the idea that suppression is not a suitable approach for flexible dissemination of magnitude tables.

While perturbative methods introduce some inconsistencies in aggregated data, they provide a controlled and secure way to access sensitive microdata while protecting against disclosure risks.

Protecting magnitude tables with Cantabular

Based on our research and in collaboration with an NSI, we have identified and developed a suitable methodology for protecting magnitude tables.

We have adopted an approach that is very similar to the methodology implemented for microdata.no.

Our approach integrates:

Cell-key perturbation
Winsorization (top coding)
Magnitude perturbation
Disclosure rules

Now, we are conducting rigorous evaluations across diverse datasets to refine its effectiveness and ensure real-world applicability.

Demo: EU-SILC magnitude data in Cantabular

To showcase Cantabular’s capabilities in protecting magnitude tables, we have created a demonstration dataset based on synthetic EU-SILC data produced by Eurostat. This dataset allows users to explore how our SDC techniques—including cell-key perturbation, winsorization, and magnitude perturbation—work in practice to safeguard sensitive data while maintaining usability.

Researchers can use Cantabular to access the data they need to test their hypotheses with confidence. For example, the EU-SILC synthetic dataset highlights a clear gender pay gap in Ireland, both in Northern Ireland and the Republic. If real data were loaded, researchers could verify whether this pattern holds true in reality, ensuring their findings are based on accurate, up-to-date statistics.

Screenshot: Gender pay gap in Ireland using EU-SILC synthetic data query in Cantabular

You can access the synthetic dataset on our demonstration websites here:

Other blogs in this series:

Linking EU-SILC Datasets: Unlocking Insights with Cantabular