Millies of the Shankill come out tops based on new occupation data revealed in the 1911 Ireland Census

Sometimes, working on software presents opportunities to do something exciting with data. This is one of those moments. When we launched the Ireland 1911 census in 2021 as a statistical tool for everyone, we held back the occupation variable because the data wasn't ready. Today, we're pleased to announce its release and hope others can uncover fascinating stories within the data.

How many people appreciated that the single biggest concentration of textile workers in a district electoral division on the island of Ireland was women in the Shankill? Or, that Soldier occupations were obscured on the original 1911 digitized forms and that it would take coding to rectify this situation? And, that persons who worked in the printing industry were concentrated in Dominick Street Lower in Dublin?

Undertaking this work required detective work, manual classification, coding and guidance.

The two biggest challenges in completing this work were 1) dealing with over 170,000 unique occupations and corrections for 2) military personnel who had indicated their previous, but not current occupation.

This work would not have been possible without the support of the National Archives for hosting the 1911 Census and the CSO for providing guidance on potential approaches to occupation classification. Also a shout out to academics Dr. Sean Lyons (ex-ESRI) and Dr. Niall Cunningham (Newcastle University) who supported and encouraged us to release this data. A special thanks to Steve and Peter at Sensible Code for data manipulation, and to Gerry O’Hanlon and Lily McGuire for their work in checking and manually classifying the higher frequency occupations.

The methodology we employed

Our objective was to replicate as far as possible the classification of occupation undertaken for the purposes of the original reports. It should be noted that the latter was a totally manual operation informed by local knowledge and detailed instructions developed by the Census managers to handle specific cases. We, on the other hand, applied automated procedures, informed by limited background information derived from the published reports

In initial processing we manually examined the list of unique occupation descriptions and classified approximately 6,000 of them into one of the 24 different Orders used in the original census reports and which categorized occupations. We focused on the most frequently appearing descriptions. While complete agreement could not be expected, particularly at the detailed level, initial comparisons showed a good degree of comparability except for the military, police, and the classification of the retired. To achieve acceptable results, we needed to delve deeper into the digitized records and apply software automation.

Form A, known as the Household Return, was used to document residents living in private households. However, for individuals residing in institutions, different forms were employed. This distinction is crucial for accurately interpreting census data.

Military personnel and police officers living in barracks were recorded on Form H, the Barracks Return. This form uniquely instructed these individuals to state their occupation prior to enlisting, resulting in digitized records that reflect their previous occupations. The manual coders, on the other hand, would have been aware that they were dealing with a barracks and would have classified the residents as military or police as appropriate. For example, there is a barracks in Kildare housing (according to the digitised record) 679 male residents, with occupations listed as "farrier," "tailor," and "clerk."

The digitized records available on the National Archives website do not explicitly indicate the form used for each individual. However, the webpage for each "house" lists the forms associated with it. This information enabled us to identify individuals recorded on Barracks Returns, helping us accurately categorize military and police personnel.

Differentiating between police and military personnel proved to be more complex. The only way to ascertain whether a Barracks Return pertained to a military or police barracks was through the manual examination of the original scanned forms. We meticulously created a list of all police barracks, allowing us to distinguish between the two groups effectively.

Adding to the complexity, army and navy pensioners were classified in the same order as active personnel. In a manner similar to identifying military and police personnel, we identified individuals returned on Form E (the Workhouse Return) and Form I (the Return of Idiots and Lunatics in Institutions) and assigned them to the Unproductive Order. Additionally, we categorized all records of individuals under 15 years of age into the same order.

Moving on from the police and military, the 1911 Census general report reveals that the occupants of certain institutions, children (those under 15 years of age), and individuals retired from business were all classified in Order 24 (Unproductive):

Furthermore, we searched for terms like "pensioner" and "retired" in the occupation descriptions and classified these occupations as Unproductive, except in cases where they referred to military pensioners.

We then focused on improving the quality of our manual occupation classification by incorporating software processing. First, we applied standard string processing techniques to identify similar descriptions and ensure that all such occupations were classified under the same Order. For example, after manually classifying "Professor of Music" to Order 3 (Professional Occupations), we used these techniques to automatically classify variations such as "Professor of music" (different case), "Professor Music" (without "of"), and "Professor in Music" (using "in" instead of "of") consistently.

Next, we manually reviewed the original census reports to compile a comprehensive list of occupations for each Order. This list was instrumental in correcting our initial manual classifications and in identifying previously missed occupations. We standardized classifications for terms like "apprentice" or "assistant" to ensure consistency; for instance, "apprentice butcher" was classified the same as "butcher."

Finally, we focused on classifying some of the remaining occupations by searching for specific keywords such as "army," "navy," "police," "r i c," "barrister," "medical doctor," and "clergy". We selected a small, targeted set of keywords where we were confident there was a strong correlation between the presence of a keyword and a particular occupation.

As a result, we successfully classified approximately 22,000 unique occupation descriptions, covering approx 94% of all records.

With additional time and resources, we are confident that we could classify an even larger number of occupations and further refine our existing classifications. Potential areas for future investigation include:

Spelling Corrections and Similar Descriptions: Using string similarity methods like the Jaro-Winkler algorithm and Levenshtein distance to correct spelling errors and identify descriptions similar to those already classified.
Keyword Identification: Finding keywords and terms uniquely associated with specific occupations to enhance classification.
Natural Language Processing (NLP): Applying NLP techniques, such as random forest classifiers, to classify occupation strings with similar meanings.
Large Language Models (LLMs): Leveraging recent advances in AI, especially LLMs, to identify and classify similar occupation descriptions, even when the terminology differs. While LLMs excel at understanding the "sentiment" of descriptions, their implementation would require careful prompt engineering or fine-tuning.

These approaches hold promise for improving our classification system and expanding its coverage.