Irish Census 1911: Data collection and processing methodology

Aim of this document

This document describes the scope, methods, assumptions and limitations of the work that The Sensible Code Company and collaborators have collectively completed on historical Ireland census data, collected from The National Archives of Ireland’s website, in order to publish a publicly available interactive dataset at https://ireland-census-preview.cantabular.com.

Licence

As this dataset makes use of publicly available open data, all cross-tabulations derived from it are also made available under the terms of an open data licence.

Publication of data

This work enables fast querying and publication of statistical tables from the Irish 1911 census data using The Sensible Code Company’s software Cantabular, and is available online free of charge. It builds on the excellent work that the National Archives of Ireland have completed in digitising the original census returns.

The Sensible Code Company has made best efforts in a limited time to deal with obvious inaccuracies in the data based on a comparison with the original published reports. However, there are still some potential quality control issues with the source data that have not yet been corrected. Some of these issues are documented herein. There may be other data quality issues not yet discovered.

It should be noted that any tables, plots, maps or results derived or shared from this dataset are provisional tables and may change in future versions of this data release.

Future of the data

The dataset as presented is a preliminary dataset. The Sensible Code Company is collaborating with the CSO who have taken up this work, and in due course will publish a link to the final dataset as and when it’s ready for publication. If, in the interim, changes are required to be made to the variables in this dataset, these changes will be noted in future versions of this document and, where relevant, in the metadata for the relevant variable.

Processing methodology

National Archives dataset

There were several steps involved in going from census data published as HTML tables on the National Archives of Ireland’s website to a clean dataset that could be loaded in Cantabular:

  1. Data collection from the National Archives’ address-level census tables via an automated script. This produced a comma-separated values (CSV) format file containing raw microdata for the census. This CSV file is value-based: each CSV row contains details about an individual in the census, along with their digitised responses as published on the National Archives’ site.

  2. Addition of geographical identifying codes (US Census Bureau-style GEOIDs) to the value-based CSV to uniquely identify each geographic area, e.g. streets with the same name, within the data. A few areas within the geography hierarchy that were found to be incorrectly placed were migrated to their correct locations.

  3. The raw microdata values in the collected data were then reviewed. Canonicalised values were derived by hand by manual review of the data before programmatically applying these categorisations to the underlying data. For instance, raw birthplace values of “County Cork”, “Co. Cork”, and “Cork County” would all be assigned to the “County Cork” category. Not all inputs, especially for those variables with many distinct inputs — occupation, religion and birthplace — have yet been categorised. More information of categorisations for individual variables is given in the Dataset variables section of this document.

  4. The value-based microdata was processed to apply the canonical categories and perform some simple imputations to try and fill missing data. For instance, missing ages with a relation to the head of the family of “son”, “daughter”, “grand son” and so on would all be assigned an age of zero. The output of this step was a “code-based” microdata CSV file, with the raw microdata values now substituted with appropriate codes for categories, and “codebook” files that describe:

    • which categories are associated with each code, and
    • the geographic hierarchy
  5. The code-based microdata CSV and codebook were then loaded into Cantabular’s cantabular-make-dataset software to generate a Cantabular dataset file—a binary format optimised for load time and query speed. There is no statistical disclosure control applied to this dataset as the underlying data is in the public domain. There is, however, a single rule applied in Cantabular that limits the maximum number of publishable cells in a table. This rule exists primarily to avoid excess server computation if someone selects tables far larger than would ever be likely useful. This limit is currently set to 50,000,000 cells.

  6. The dataset file, along with associated metadata describing the data and variables, is then loaded into a Cantabular server. A free-to-use public server has been set up that allows access to the data via a simple web user interface for table building, or the available application programming interface (API) for software integration and automated querying.

Many of the operations described above are codified in computer programs written in the Python programming language and stored in source version control software Git. The data analysis library Pandas was used to manipulate data and Jupyter notebooks were often used to contain and document code. All of this helps to ensure that the process is as reproducible as possible.

Quality assurance

As part of this work, a quality assurance process was undertaken to validate the results against 1911 original tables published in 1913 and sourced from Histpop.

  1. Extract the tabular data from the 1911 image TIFF file to CSV:

    1. Download TIFF file from histpop.org.
    2. Convert TIFF file to PDF (scanned) using tiff2pdf.com.
    3. Convert PDF (scanned) to PDF (text-based) with OCR technology using PDF2Go or Online OCR.
    4. Convert PDF (text-based) to CSV using PDFTables (made by Sensible Code).
  2. Collate and normalise the CSV data: After the tables of data were extracted from the TIFF files to CSV, the data could be collated and normalised in Python to allow comparison with the outputs from the Cantabular dataset. The reformat of the data was achieved by:

    1. Loading the CSV files into Python using Pandas.
    2. Restructuring the original report data into a usable table, then reformatting and cleaning the data. For example, OCR conversion can produce garbled and incorrect outputs such as “Londonderry” appearing as “Londond rr5”.
    3. Normalising the data by removing common and unnecessary characters such as “,”, “(” and “)”.
    4. Making new categories. For example, in some cases data from the original reports required the creation of new categories (from multiple existing categories). “Cork E,R” and “Cork W,R” become one category “Cork”.
    5. Manual intervention was necessary for some edge cases. The TIFF files and CSV data were scanned manually by eye and subsequently counts were corrected.
  3. Run queries to generate equivalent tables from the Cantabular dataset: A query was made through Cantabular which matched (or was close to) the original report and the data were downloaded in CSV format.

  4. Output comparison: A comparison was carried out of original and Cantabular-generated counts for a subset of variables by:

    1. Creating a Pandas dataframe containing a cross-tabulation of the variable by county for the Cantabular dataset. If the classification of variables was not an exact match to the 1911 original table, transformations were carried out in order to create a match, such as banding the “Age” variable or amalgamating multiple “Religion” categories into one. Totals for both rows and columns were added to the cross-tabulations. This data was also normalised by removing common and unnecessary characters such as punctuation.
    2. Creating a second dataframe containing a cross-tabulation of the variable by county from the relevant 1911 original table. Totals for both rows and columns were added to the cross-tabulations.
    3. Subtracting the ‘1911 original table’ dataframe from the Cantabular dataframe in order to generate a new dataframe showing the absolute difference between each category and county combination for the variable being compared.
    4. Creating a final dataframe to show the percentage difference for each category and county combination for the variable being compared between the 1911 original table and the Cantabular data.
    5. Once these differences had been identified an investigation was carried out in order to determine why the differences occurred and if the data could be corrected to reduce the differences.

An HTML file has been made available to show examples of quality assurance carried out as described above.

Investigation into data differences

After some manual exploration of the differences identified in the quality assurance process, a number of different explanations for the differences were identified.

A video recording has been published of an online event “1911 Irish census and technology: bringing past, present & future together”, hosted by The Sensible Code Company. This link will take you to the section of the video that discusses some of these issues.

The most significant of these issues were as follows:

  • The “Irish language” variable was found to differ significantly from the original tables. This was due to, in some areas, a higher number of digitised returns being classified as being able to speak Irish than in the original reports. This issue is addressed in more detail in the Dataset variables section below.
  • The “Occupation” variable has not been published as part of this dataset due to discrepancies with the original reports. It requires a considerable amount of work to process as it had around 180,000 unique responses. Our work to classify it has so far achieved approximately 92%–94% accuracy, but this is considerably short of our target of 98%–99%. This issue is addressed in more detail in the Dataset variables section below.
  • We identified some discrepancies in the geographic hierarchy in the digitised data. For example, on the National Archives of Ireland website three Belfast District Electoral Divisions (DEDs)—Pottinger, Ormeau and Victoria—are shown as part of County Down. In the 1911 census reports these are shown as part of County Antrim. We also found some missing DEDs: for example the DED of Inver in County Donegal is included in the 1911 census reports but is not present in the National Archives hierarchy and the townlands that should be in Inver are instead included within the Dunkineely DED.
  • A number of 1911 census returns were written in the Irish language, however the digitisation process passed over these because the system could not interpret these responses. This has resulted in an undercount for some of the variables.
  • An undercount was also noticed for the “Religion” variable. This was due to people filling out forms using symbols to denote common responses or to indicate that their response was the same as the one above, for example “-” or “- -”. This issue is addressed in more detail in the Dataset variables section below in the “Religion” and “Religion (naive imputation)” variables.

Once all changes were made, the Cantabular dataset was updated with the modified variables.

As this dataset is still a preview rather than a finalised statistical product, other errors may remain, for example:

  • There are cases where individuals have provided the wrong data in the wrong column of the census form. These obvious errors may have been corrected by the original census collectors but they persist in the digitised records.
  • There may also be errors introduced by the OCR step in the original digitisation process.

This is a continuous process and it is expected that as feedback is gathered on the data more discrepancies will be identified and hopefully resolved.

Dataset variables

Province

The Irish province in which a person lives. Either Ulster, Leinster, Munster or Connacht.

Categories:

  • Ulster
  • Leinster
  • Munster
  • Connacht

County

The Irish county in which a person lives. The 1911 county names “King’s Co.” and “Queen’s Co.” are now referred to as “Offaly” and “Laois” respectively.

Categories:

  • Antrim
  • Armagh
  • Carlow
  • Cavan
  • Clare
  • Cork
  • Donegal
  • Down
  • Dublin
  • Fermanagh
  • Galway
  • Kerry
  • Kildare
  • Kilkenny
  • King’s Co.
  • Leitrim
  • Limerick
  • Londonderry
  • Longford
  • Louth
  • Mayo
  • Meath
  • Monaghan
  • Queen’s Co.
  • Roscommon
  • Sligo
  • Tipperary
  • Tyrone
  • Waterford
  • Westmeath
  • Wexford
  • Wicklow

District Electoral Division (DED)

The Irish District Electoral Division (DED) in which a person lives.

Categories: 3,659 categories

Townland or Street

The Irish townland or street in which a person lives.

Categories: 70,610 categories

Birthplace

Birthplace is where a person was born. If the person was born in Ireland then the county (if known) is provided. Otherwise the country of birth is given.

It is derived from the “Birthplace” field in the digitised records. Those records contain 35,475 different free text entries.

A lookup table has been used to map 95% of the values in the digitised records to two separate classifications of this variable: one of four categories and one of 41 categories.

This required a Python dictionary to be created containing the top 150 categories by descending frequency. Each was mapped to either one of the 32 Irish counties, “Not specified”, “England”, “India”, “Russia”, “Scotland”, “United States of America”, “Wales” or “Other/not classified”.

The “Not specified” category is used for digitised records where no value for “Birthplace” is specified. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

An example Python dictionary showing assignment of variable values to canonical values is shown below:

birthplace_category_mapping = {
  "Co Cork": "County Cork, Ireland",
  "Co Antrim": "County Antrim, Ireland",
  "Co Down": "County Down, Ireland",
  "Co Galway": "County Galway, Ireland",
  "Co Mayo": "County Mayo, Ireland",
  "Dublin City": "County Dublin, Ireland",
  "Co Donegal": "County Donegal, Ireland",
  "Belfast": "County Antrim, Ireland",
  "Co Kerry": "County Kerry, Ireland",
  "Co Tipperary": "County Tipperary, Ireland",
  "Co Tyrone": "County Tyrone, Ireland",
  "Co Armagh": "County Armagh, Ireland",
  "Co Dublin": "County Dublin, Ireland",
  "Co Limerick": "County Limerick, Ireland",
  "-": pd.NA,
  …
}

Categories (4 variables):

  • Not specified
  • Ireland
  • England, Scotland or Wales
  • Other/not classified

Categories (41 variables):

  • Not specified
  • County Antrim, Ireland
  • County Armagh, Ireland
  • County Carlow, Ireland
  • County Cavan, Ireland
  • County Clare, Ireland
  • County Cork, Ireland
  • County Donegal, Ireland
  • County Down, Ireland
  • County Dublin, Ireland
  • County Fermanagh, Ireland
  • County Galway, Ireland
  • County Kerry, Ireland
  • County Kildare, Ireland
  • County Kilkenny, Ireland
  • County Leitrim, Ireland
  • County Limerick, Ireland
  • County Londonderry, Ireland
  • County Longford, Ireland
  • County Louth, Ireland
  • County Mayo, Ireland
  • County Meath, Ireland
  • County Monaghan, Ireland
  • County Roscommon, Ireland
  • County Sligo, Ireland
  • County Tipperary, Ireland
  • County Tyrone, Ireland
  • County Waterford, Ireland
  • County Westmeath, Ireland
  • County Wexford, Ireland
  • County Wicklow, Ireland
  • King’s County, Ireland
  • Queen’s County, Ireland
  • County not specified, Ireland
  • England
  • India
  • Russia
  • Scotland
  • United States of America
  • Wales
  • Other/not classified

Sex

The classification of a person as either male or female.

This is based on the “Sex” field in the digitised census records. Any missing values are assigned to the “Not specified” category.

Categories:

  • Not specified
  • Male
  • Female

Age

Age is the age in years of a person at the time of the census.

It is derived from the “Age” field in the digitised records. In those records the age is given as single years in the range 0–114.

Where the record had a missing age value and had a relation to the head of the family of “son”, “daughter”, “grand son”, “grand daughter”, “grand child”, “niece”, “nephew”, “grand niece” or “grand nephew”, the age is assumed to be zero.

Age was classified as two variables, one with 22 categories and a second with 78 categories:

  • 22 categories of age: In this classification ages between 0 and 100 are given in 5-year bands and all ages 100 and over are grouped into a single category.
  • 78 categories of age: In this variable ages between 0 and 69 are given as single years, ages between 70 and 99 are given in 5-year bands and all ages 100 and over are grouped into a single category.

Relation to head of family

Relation to head of family language classifies people based on their relationship to the Head of Family.

It is derived from the “Relation to head” field in the digitised records. Those records contain 6,447 different values. This required a Python dictionary to be created containing the top entries by descending frequency. This could be simplified by removing gender, e.g. mother/father become parent. This would also allow “Grand Child” to be categorised, which is not possible with the current gendered relations.

A lookup table has been used to map 98% of the values in the digitised records to categories in this variable. The “Not specified” category is used for digitised records where no value for “Relation to head” is specified. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

Categories:

  • Not specified
  • Head of family
  • Husband
  • Wife
  • Son
  • Daughter
  • Son-in-law
  • Daughter-in-law
  • Brother
  • Sister
  • Brother-in-law
  • Sister-in-law
  • Father
  • Mother
  • Father-in-law
  • Mother-in-law
  • Grandfather
  • Grandmother
  • Grandson
  • Granddaughter
  • Aunt
  • Uncle
  • Cousin
  • Nephew
  • Niece
  • Other relative
  • Servant
  • Boarder
  • Assistant
  • Apprentice
  • Nurse
  • Other worker
  • Visitor
  • Inmate
  • Other/not classified

Marital status

This is the marital status of a person.

It is derived from the “Marital status” column in the digitised records. Those records contain 182 different values. This required a Python dictionary to be created containing the top entries by descending frequency. Some records, which are mostly children, have a marital status of “Not eligible”. In this instance these records have been put into the “Other/not classified” category.

A lookup table has been used to map 99% of the values in the digitised records to categories in this variable. The “Not specified” category is used for digitised records where no value for “Marital status” is specified. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

Categories:

  • Not specified
  • Single
  • Married
  • Widow
  • Widower
  • Other/not classified

Literacy

Literacy classifies people based on their ability to read and write.

It is derived from the “Literacy” field in the digitised records. Those records contain 3,810 different values. This required a Python dictionary to be created containing the top entries by descending frequency. The main categories are the same as in the original return form: “Read and write”, “Read”, “Cannot read”.

A lookup table has been used to map 97% of the values in the digitised records to categories in this variable. The “Not specified” category is used for digitised records where no value for “Literacy” is specified. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

Categories:

  • Not specified
  • Read and write
  • Read
  • Cannot read
  • Other/not classified

Religion

Religion identifies a person’s religious profession.

It is derived from the “Religion” field in the digitised records. Those records contain 10,895 different values. This required a Python dictionary to be created containing the top entries by descending frequency.

A lookup table has been used to map 98% of the values in the digitised records to categories in this variable. The “Not specified” category is used for digitised records where no value for “Religion” is specified. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

Categories:

  • Not specified
  • Church of Ireland
  • Methodist
  • Presbyterian
  • Protestant Episcopalian
  • Roman Catholic
  • Other/not classified

Religion (naive imputation)

Further processing has been performed to impute the religion of people without a specified religion by assigning them the value of the previous person listed in the census returns.

During the processing of the digital records, it was observed that a problem existed in the “Religion” entries where a respondent put in a “-” value. The “-” value for religion in the National Archives data is usually not due to no entry on the part of the respondent, but is frequently either an Irish language entry not transcribed, or a ditto mark. When these responses were digitised they were miscounted. This caused responses to this question to be under-counted in the database. As a result of this, a decision was made to impute “Religion” entries and fill these missing values. There will be cases where the wrong value is picked, and there may be the odd case where there really is no value specified.

The approach chosen to naively fill a new religion_imputed column with the last seen non-missing value used the pandas ffill (forward fill) function. Any “-” value is replaced with pd.NA, which is then replaced with the last valid observation preceding it. This might mistakenly propagate the last seen value in one geographic region (e.g. county) across other geographic regions. This method was chosen as it is far faster than other approaches. The original religion variable remains in the original state.

This approach might give some inaccurate results for geographies that have been moved in the hierarchy, e.g. areas in County Down moved to County Antrim. However, this change of geographical area issue happens naturally in the data anyway, e.g. the last value for County Antrim might be issued as first values in County Armagh if the religion for the first few values in County Armagh are “-”.

Categories:

  • Not specified
  • Church of Ireland
  • Methodist
  • Presbyterian
  • Protestant Episcopalian
  • Roman Catholic
  • Other/not classified

Irish language

Irish language classifies people based on whether they can speak only “Irish” or “Irish and English”.

It is derived from the “Irish language” field in the digitised records. Those records contain 974 different values.

A lookup table has been used to map 99% of the values in the digitised records to categories in this variable. The “Not specified” category is used for digitised records where no value for “Irish language” is specified—it almost certainly means that the person speaks English only. “Other/not classified” is used for values in the digitised records that are not contained in the lookup table and therefore not mapped to categories in this variable.

Categories:

  • Not specified
  • Irish
  • Irish and English
  • Other/not classified

Irish language (naive imputation)

Further processing has been performed to impute the “Irish language” variable. In compiling the data around Irish language, it was observed that some anomalies existed in the data. Overall, it was observed that there was a discrepancy in forms filled in by hand and numbers appearing in the printed reports.

In order to align the digital dataset with the original reports, the following manipulations were carried out:

Where respondents marked themselves as speaking only “Irish” it would appear that enumerators on the whole did not accept these responses unless they were from a Gaeltacht area. From what has been learnt, these were parts of the following counties: Clare, Cork, Donegal, Galway, Kerry, Mayo, and Waterford.

If the “Irish language” entry was “Irish” and the county entry was not one of these Gaeltacht areas, the entry becomes “Irish and English”. In the case of County Tyrone this meant that over 200 “Irish” speakers were re-classified as “Irish and English”. In some Ulster communities that were predominantly non-nationalist and who put “Irish” as their response were moved into the “Not specified” category. In County Antrim alone this number exceeded over 10,000 in 1911 (and 13,000 in 1901). While the reason for this decision is unknown, the speculation is that the question was misunderstood or something political was going on at the time. It is also known that the Irish language question was dropped in the first Northern Ireland census in 1926.

The counties of Antrim, Armagh and Down were treated as special cases. If a row contained a “Religion” entry of “Roman Catholic”, an “Irish language” entry of “Irish” and a county entry in one of Antrim, Armagh or Down, the “Irish language” entry was changed to “Irish and English”.

A more sophisticated method of imputing religion may be investigated. The previous work on trying to impute religion showed that Pandas can potentially be slow on groupby of household. This variable has been left like this for now for simplicity.

Categories:

  • Not specified
  • Irish
  • Irish and English
  • Other/not classified

Forms filled in for house

This variable identifies the census forms that were returned for the house or institution where a person lived.

Different forms were used to record details of people in households and different institutions (such as barracks, hospitals, prisons etc) and it is possible to tell from the digitised returns which forms were filled in for a given house or institution.

While the digitised records do not contain sufficient information to tell us which form a person was returned on when multiple forms were filled in, this variable can act as a useful proxy for whether a person might be, for example, a soldier, a prisoner or a student.

The “Not specified” category covers people in the small number of households/institutions that do not have links to any relevant forms on the National Archives website.

Categories:

  • Not specified
  • Barrack return (Form H)
  • Barrack return (Form H)/Prison return (Form K)
  • College and Boarding-School return (Form G)
  • College and Boarding-School return (Form G)/Barrack return (Form H)
  • Hospital return (Form F)
  • Hospital return (Form F)/Prison return (Form K)
  • Household Return (Form A)
  • Household Return (Form A)/Barrack return (Form H)
  • Household Return (Form A)/Barrack return (Form H)/Prison return (Form K)
  • Household Return (Form A)/College and Boarding-School return (Form G)
  • Household Return (Form A)/College and Boarding-School return (Form G)/Barrack return (Form H)
  • Household Return (Form A)/Hospital return (Form F)
  • Household Return (Form A)/Workhouse Return (Form E)
  • Prison return (Form K)
  • Return of Idiots and Lunatics in institutions (Form I)
  • Return of the sick at their own homes (Form C)
  • Workhouse Return (Form E)
  • Workhouse Return (Form E)/Hospital return (Form F)
  • Workhouse Return (Form E)/Return of Idiots and Lunatics in institutions (Form I)

Variables not included

Five variables present in the digitised records are not currently included in this dataset due to data quality issues, as described below. Some work has been completed to prepare them for inclusion, but they did not pass our quality assurance checks.

Occupation

As noted above, the occupation variable includes around 180,000 unique responses and while work has been undertaken to classify these, it is currently achieving approximately 92%–94% accuracy, which is considerably short of our target of 98%–99% accuracy.

Collaborations with the CSO are in place to improve the accuracy of this classification. Once this work has been completed this variable may also be made available.

Specified illnesses

The specified illnesses variable requires a considerable amount of additional work to get the processed data to match the original reports sufficiently accurately.

From a preliminary analysis it would appear that the digitisation process did not catch the variety of illnesses recorded. Once this work has been completed this variable may also be made available.

Children born, Children living and Years married

The Children born, Children living and Years married variables all require further imputation and quality assurance before they can be included in this dataset.

The 1911 census form has questions relating to a current marriage: its duration, the number of children born alive within it and the number of children born alive and still living within it.

The instructions stated that only married women should provide an answer to these questions. However, in many cases they were also answered by men. The original census collectors may have adjusted the returns but the answers persist in the digitised records.

Find out more about Cantabular

Cantabular is a software framework for the protection and dissemination of statistical data.

Read more about Cantabular