Flexible publication of metadata with GraphQL

Last year we added a new capability to Cantabular to allow dissemination of structured metadata alongside real-time creation of safe, cross-tabulations from large datasets that the rest of Cantabular focuses on.

For example, people using data published with Cantabular might need information on its geographic or temporal coverage, the contact details of its creators, descriptions of the variables used, and background on methodology and privacy techniques.

The metadata service we created provides a flexible mechanism for loading and publishing exactly this kind of metadata. This is partly in order to supplement our user interface, but also to inform subsequent processing and analysis.

This is a more technical blog post than some of our other recent posts. It goes into detail on how our metadata service works, and gives examples that you can try out on a live version of the software used in our publication of the 1911 Ireland census.

Guiding principles

When adding the metadata capability to Cantabular, and after understanding requirements from our customers, we started with a set of principles to guide us:

Minimal core: we wanted to minimise changes to the core of the software that handles sensitive data to continue to keep it as simple and secure as possible and not introduce any dependencies that could be subject to supply chain attacks in future.
Flexibility in input: we wanted to give customers flexibility in how they attach metadata to different concepts in our software, giving them the ability to define the fields they need in the structure they need.
Flexibility in output: similarly, we didn’t want to bake in a requirement to conform to a particular output standard or schema, but rather have the flexibility to support a range of different schemas and, given the very large range of possible fields, allow flexible querying of that schema.

A GraphQL-powered micro-service

Our solution to this, designed by our lead developer Duncan Harris, was a GraphQL-based micro-service designed to load and publish metadata independently from the rest of our software. This allowed us to keep the main engine of the software—Cantabular server—unchanged.

We also adapted our public user interface to incorporate specific metadata fields, when available, to provide helpful context to a person creating their own table.

The service receives requests in GraphQL format for metadata and responds with standard JSON output in a structure that mirrors the GraphQL request. While libraries are available in many languages to work with GraphQL, it is straightforward to send requests and parse responses without using one.

The screenshots below show where some of this additional information appears in the user interface:

Screenshot: Contact details and licensing metadata are shown to the right in Cantabular’s public user interface

Screenshot: The panel on the right shows metadata for a variable in Cantabular’s public user interface

The diagram below shows how the metadata service sits alongside other components of our software:

Diagram: How the metadata service relates to other Cantabular services

GraphQL for queries and metadata schema definition

GraphQL is commonly used as an API standard to allow flexible access to related objects and fields, allowing a user to get exactly the fields of data they need and no more. Our metadata service uses GraphQL for its API to give us exactly this kind of flexibility in output. But we also decided to use it for schema definition.

Instead of a fixed set of metadata fields that someone using the service can pick from, we use GraphQL schema language to allow a user to define their own schema at runtime. We then parse this schema and validate supplied metadata against it. This allows a customer to build their own schema to match the metadata that they need to publish.

The user-provided schema has three built-in concepts that other user-defined types and fields can be added to:

ServiceMetadata: for fields related to a group of datasets being published together through a single website or API, such as contact details or licensing and copyright information.
DatasetMetadata: for fields related to an individual dataset, such as a release date, statistical units, geographic or temporal coverage.
VariableMetadata: for an individual variable, such as a description, a link to read more or information on the range of values the variable can take.

Metadata examples with the Ireland 1911 census

Our re-publication of the 1911 Ireland census using Cantabular has been live for a few months now, and uses our metadata service to supply the user interface with exactly the kind of information described above.

The metadata service’s GraphQL API is also publicly available at https://metadata.ireland-census-preview.cantabular.com/graphql. Visiting this link in a browser will show an interactive IDE that allows you to experiment with queries yourself.

Service metadata example

For example, to see the contact details and licensing information for the service, you can use the following query:

query {
  meta {
    contact {
      name
      email
      phone
      website
    }
    license
    copyright
  }
}

Try it out!

Variable metadata example

To get the names, descriptions and links for all the variables in the dataset you can use the following query:

query {
  dataset(name: "Ireland-1911-preview") {
    vars {
      name
      meta {
        description
        url
      }
    }
  }
}

Try it out!

All metadata example

And to get all the metadata available through the service, use the following query:

query {
  meta {
    websiteTitle
    websiteDescription
    license
    contact {
      name
      email
      phone
      website
    }
    copyright
  }
  dataset(name: "Ireland-1911-preview") {
    name
    meta {
      description
      methodology {
        link
        statement
      }
      sdc {
        link
        statement
      }
      units
    }
    vars {
      name
      meta {
        description
        url
      }
    }
  }
}

Try it out!

Metadata schema example

The underlying schema we’ve written using GraphQL schema language and loaded up into this instance of our metadata service is shown below:

type ServiceMetadata {
  websiteTitle: String
  websiteDescription: String
  contact: Contact!
  license: String
  copyright: String
}

type Contact {
  name: String!
  email: String!
  phone: String
  website: String
}

type DatasetMetadata {
  description: String!
  methodology: Methodology
  sdc: Methodology
  units: String
}

type Methodology {
  statement: String!
  link: String
}

type VariableMetadata {
  description: String
  url: String
}

Ideas for the future

We’ve been pleased with how well this idea has worked in practice and we’re keen to see how it can be extended in the future.

Our current ideas are that it could be used with minimal changes to support the provision of metadata in multiple languages; we could extend the built-in concepts to also include variable categories to allow the addition of metadata directly to them; we could build adaptors to allow the conversion of our outputs to other metadata standards; and perhaps even develop authoring and collaboration tools to help with metadata creation.

If you’ve managed to read this far, then maybe you’d like to drop us a line and let us know what you think