5 minute read
Unlocking the value in historic records with Cantabular
At the end of April we at Sensible Code are going to run an event to demonstrate our privacy-preserving software Cantabular using the 1911 Ireland census—the last all Ireland census—to show how modern technologies have the potential to unlock the value hidden in historic records, for the benefit of academics, researchers, statistical organisations and wider society.
When a census collides with political reality
After 1911, the next census did not happen until 1926 by which time there were two separate jurisdictions on the island of Ireland.
During that intervening period an awful lot happened, culminating in the creation of two separate states to accommodate the opposing priorities of the unionist and nationalist traditions. These two traditions are very closely linked to religious adherence and geographical location.
In some respects you could argue the 1911 Census is a jumping off point to this revolutionary period in Irish history, the consequences of which we still see to this day.
From raw records to dynamic datasets
Having alighted on the idea of exploring the 1911 census we carefully scraped and recompiled the microdata from the excellent National Archives of Ireland website. The individual census returns under Irish Law are available in electronic form from the National Archives.
Armed with the raw data, we processed, cleaned, categorised and automated the creation of a codebook to describe its structure. We joined the microdata with the codebook and loaded both of these into our software Cantabular.
At this point two things became possible: firstly, we could recreate the 1911 reports electronically. Of course we are not yet near a statistical product on a par with what a national statistical institute could produce, but we are now on that road and are hoping to improve this dataset over time.
Secondly, and more importantly, Cantabular allows for new and interesting cross-tabulations of the data, opening up avenues for analysis that were previously impossible.
For example, a researcher could query Irish language by age for each of the roughly 70,000 townlands. Or look at the female population along the industrialised Lagan valley to correlate occupation and health, by number of children. While this type of analysis is normal for a modern census this was not the case when these reports were released in 1913.
This is adding new value to historical data, and combined with the application programming interfaces (APIs) in Cantabular and modern analysis and visualisation tools, all sorts of interesting things become possible.
Working together with the Central Statistics Office Ireland (CSO) and OpenStreetMap Ireland (OSM)
At our event in April we will talk more about how all of this work has happened. We will also recognise the support and encouragement of the CSO and acknowledge the foundational work of the National Archives.
We realised early on in this project that while it’s great to be able to generate new tabulations, it’s so much better if we can show them on a map. So we have been working with OpenStreetMap Ireland to see if we can make use of improved geographic boundary data for 1911, initially focusing on getting Ulster complete and geospatially correct.
Unlocking new insights and analysis
Through our work on this unique census we hope to empower researchers and historians to dig deep into the data and potentially unlock secrets previously hidden.
During our event, we will look at some examples to give a flavour of what is possible.
Perhaps we could delve into why the Church of Ireland Gazette said in March 1929 ‘there followed the forced exodus of large numbers of their parishioners at the height of the troubles’. Could we look at the concentrations of religious communities in 1911 at the townland level of aggregation to more closely understand the nature of some of these communities?
Over the next few weeks we’ll release a number of blog posts which will give more detail on the journey we’ve taken to transform the Irish 1911 census into a real-time rich dataset.