Aims and context
The National Archives has a large catalogue, available online since 2000. Our catalogue is itself an archival record while also a crucial business asset. Currently our data, which is largely aligned to the standards ISAD(G) and ISAAR(CPF), is stored in a relational database. Over the past 20 years, the infrastructure supporting our catalogue has expanded into an ecosystem of more than 10 databases. Separate databases were built, for example, to store information about closed and retained records and (later) to store and preserve born-digital archives.
We believe that a pan-archival catalogue is now possible, as long as we are ready to move beyond the standards and technologies used to manage paper documents and a first generation of digital archives. We want to re-imagine archival practice by pioneering new approaches to description and access (for both analogue and digital records). This means that we have to re-think our data model.
Project Omega aims to determine a sustainable data model that opens the door to a future master catalogue to hold a canonical set of metadata for all our records, regardless of medium. This data model should be flexible enough to support a second generation of complex born-digital accumulations. Our proposition is to move towards a single pan-archival linked data catalogue, removing the separation between physical and digital records, and taking a holistic view of an archive’s assets (including digital copies and other manifestations of our documents).
By consolidating the number of systems used to manage catalogue data, we will be able to introduce better workflows for data enhancement and enrichment and for managing access or publication restrictions. We also imagine enhancing confidence in the integrity of the data, by introducing robust version and provenance information.
This new approach should provide firm foundations for delivering the aims of our ‘Archives for Everyone‘ strategy, enabling The National Archives to free our data and to unleash the power of an archival catalogue in a way that can support new forms of user engagement, participation, data re-use and research. This is the intersection where Project Omega (catalogue management) and Project Etna (catalogue and website publication) meet.
Project Omega findings
The project began in November 2019. After a research phase looking into our own catalogue structures and into the external data modelling and ontology landscapes, Omega has delivered:
- A strengths and weaknesses analysis document exploring the existing standards to underpin our conceptual data model (January 2020).
- A paper, Omega Catalogue Model Proposal (PDF, 3.6MB), outlining findings and technology recommendations. The paper evaluates 35 test cases with catalogue data expressed using different ontologies.
- A set of recommendations for the Omega Proof of Concept:
- use of the Records in Contexts Conceptual Model (RiC-CM)
- use of a combination of existing, mature vocabularies, inspired by the Matterhorn RDF ontology. This hybrid approach allows us to re-invent the wheel while also enabling us to talk about concepts in a wider context, reaching beyond the world of archives. Unfortunately the RiC Ontology (RiC-O) does not meet our needs to handle revisions, manifestations and associated provenance and access rights information
- use of a graph database (AWS Neptune)
- An approach for temporally aware description to model metadata variation over time, separating clearly the enduring form of a record from its transient descriptions. Any changes to the description or arrangement of a record generate a new description and/or arrangement. Any fact established in the past is therefore immutable and fully transparent; we can only add new revisions.
Additional properties in the Omega Data Model will describe relationships between versions and their temporal extent, using the W3C (World Wide Web Consortium) provenance vocabulary to describe how our records change. PROV gives us the ability to model and store information about revisions, agents (people/organisations), and activities (the process of changing something) (March 2020).
- An approach to model and encode catalogue data regarding the legal conditions governing access to public records in the UK, using the Open Digital Rights Language (ODRL) vocabulary (April 2020).
- A transformation pipeline plug-in that converts and serialises our data into RDF (Resource Description Framework) (March 2020).
- An application to encode identifiers (April 2020).
- User interface workshops, identifying the key ways that staff managing the catalogue interacted (or would like to interact) with the data (March-April 2020).
- The Omega Catalogue Data Model for our proof of concept (May 2020).
Project blogs on interesting technical challenges that we have encountered can be found at https://catalogueprojects.medium.com/.
We are ready to develop a pilot platform, API and user interface to perform a selected set of core tasks and workflow, acting as a demonstration of a replacement for our 20-year-old catalogue management system (PROCAT Editorial). We will be sharing our findings online and through conference submissions.
Participants: Jone Garmendia (Head of Cataloguing), Alex Green (Service Owner) Faith Lawrence (Project Manager/Data Analyst), Adam Retter (Consultant Developer/Technical Architect), The National Archives.