data:image/s3,"s3://crabby-images/440c6/440c6919076a89e59b7ae6874b2633db2ab57b04" alt=""
Photo by Joshua Sortino on Upsplash
7 minute read
Linking EU-SILC Datasets: Unlocking Insights with Cantabular
When it comes to working with complex datasets like the EU Statistics on Income and Living Conditions (EU-SILC), the challenge isn’t just about managing the data - it’s about making it meaningful - here’s what we did.
About the data
The EU Survey on Income and Living Conditions (EU-SILC) is an annual survey conducted in all EU Member States, Iceland, Norway, Switzerland, and North Macedonia. It provides data on income, poverty, social exclusion, and living conditions.
Eurostat produces public microdata, or public use files from the EU-SILC. While these are based on the original data, they are actually fully synthetic datasets that are produced using statistical modelling to simulate the structure and distribution of the original data. The synthetic data do not contain any original data and therefore do not pose any privacy risk to the survey respondents and are made available primarily for training researchers in how to use the EU-SILC scientific research files, which contain the original data. Accordingly, they are perfect for our demonstration purposes.
The following four synthetic datasets are available for each country and survey year:
Household register (D)
Person register (R)
Household data (H)
Person data (P)
Detailed metadata accompanies each survey period’s data - for example, the 2013 metadata offers a deep dive into variable definitions and data structures.
Cantabular has some powerful new features, supporting magnitude tables and time series data, so we set about creating a demonstration dataset to showcase these exciting new capabilities.
Importing the data into Cantabular
The first step in leveraging Cantabular’s powerful querying capabilities was linking the person data (P) to household register (D) data using the synthetic household ID, which is available on each of the datasets. This connection allowed us to enrich person-level records with geographic details like country and NUTS1 region.
We developed a Python script to automate the linking process, extracting only the necessary columns from both datasets. With this script we concatenated the output files from each country’s 2013 synthetic survey data into a single, unified file ready for import into Cantabular.
We imported from the Person data some useful categorical columns around NACE, Sex, Economic Status and Year of Birth.
data:image/s3,"s3://crabby-images/f596f/f596f93544cf9f12710de2d52df5e5f40fcf778c" alt="Screenshot: Snippet of Python code written to process and link data"
Screenshot: Snippet of Python code written to process and link data
The Year of Birth variable is particularly useful for calculating the age of respondents at the time of the survey in 2013, a dynamic metric that plays a critical role in age-based analysis.
By importing variables as “measures” we unlocked aggregation capabilities like sum and mean; we chose to include Employee cash or near cash income (PY010G) and Contributions to individual private pension plans (PY035G) in our dataset.
Scaling up with Time-Series data
Cantabular’s support for time-series data allowed us to broaden our scope. We expanded our Python script to include the Survey Year (DB010), enabling multi-year analysis. We loaded data from 2010 through 2013 across a selection of European countries.
However, this scaling introduced a new challenge: our original age calculation, based on 2013, was no longer sufficient. To solve this, we adapted our script to calculate age by subtracting the Year of Birth from the respective Survey Year. This small tweak ensured our data remained accurate across different time periods.
Linking household data for deeper insights
While person-level data provides valuable insights, combining it with household-level variables opens the door to richer, more nuanced analysis. Again linking data on household ID, this time we joined the Household Data (H) to the Person Data (P).
We extended our Python script to process household data, selecting variables that could shed light on living conditions and financial resilience such as: Dwelling type (HH010), Ability to keep home adequately warm (HH050), Capacity to afford paying for one week annual holiday away from home (HS040), Ability to make ends meet (HS120) and Crime violence or vandalism in the area (HS190).
From data to insights: querying with Cantabular
With our time-series, multi-level dataset fully integrated into Cantabular, the possibilities for analysis expanded dramatically.
To delve deeper into the interplay between individual health conditions and household circumstances, we incorporated additional person-level variables into our dataset; General Health (PH010), Chronic (long-standing) illness or condition (PH020) and Limitation in activities because of health problems (PH030).
data:image/s3,"s3://crabby-images/0ebff/0ebff6c2c26b5e916ef255275ecbd28b9acc76f2" alt="Screenshot: Pivot your table in the Cantabular Public UI"
Screenshot: Pivot your table in the Cantabular Public UI
What has been achieved with the synthetic data can be replicated for the original data and hence using Cantabular’s robust querying capabilities, we could potentially explore several insightful linkages were this demonstration based on original rather than public synthetic data:
Impact of Health on Financial Stability:
We could examine the relationship between chronic illness (PH020) and the ability to make ends meet (HS120), disaggregating by self-defined economic status (PL031). This could reveal trends where households with chronically ill members are more likely to report financial strain, even after controlling for economic status.
Living Conditions and Health Outcomes:
By linking general health (PH010) with the ability to keep home adequately warm (HH050), we might identify a correlation between poor heating conditions and deteriorating self-reported health.
Crime and Physical Limitations:
We could explore whether perceptions of crime, violence, or vandalism in the area (HS190) is associated with higher reports of limitations in daily activities due to health problems (PH030). This could provide insights into the broader environmental factors impacting personal health.
The ability to effortlessly aggregate and disaggregate across time and dimensions highlights the power of Cantabular.
What started as disparate files split across countries and years became a cohesive dataset that could reveal critical insights into living conditions across Europe.
Enriching data with metadata integration
One of Cantabular's key features is its multilingual metadata service. The detailed metadata provided in resources like the 2013 EU SILC metadata PDF document can be extracted and attached to our dataset in Cantabular. This means each variable imported into Cantabular is enriched with its corresponding metadata, ensuring users have immediate context for their analysis.
data:image/s3,"s3://crabby-images/20807/20807b1a392f00ec6f343187b30ac0e744fca2c2" alt="Screenshot: Cantabular variable metadata in Gaeilge language"
Screenshot: Cantabular variable metadata in Gaeilge language
In the Cantabular UI an intuitive "i" (info) icon available next to each variable allows users to see a detailed description, including any relevant definitions, coding schemes, and data collection methods; and for those looking to access Cantabular using its rich API, this same information is available there too.
Seamless access to metadata not only enhances the interpretability of complex variables but also fosters transparency and trust.
By bridging raw data with comprehensive documentation, Cantabular empowers analysts to dive deeper, faster, and with greater confidence in their results.
You can access the synthetic dataset on our demonstration websites here:
Other blogs in this series: