Safeguarding the nation’s digital memory

Five people sit round a table with laptops and papers on the table, looking down at some of the papers. There are three more people at another table in the background.

Workshop participants review the draft risk model

The nation’s digital heritage is rich, complex and fragile. This material – born-digital records (in a variety of formats), web archives, digitised archival materials – is under threat from rapidly evolving technology, outdated policies and a skills gap across the archives sector. To preserve this heritage for future generations, we must understand and navigate a vast and ever-shifting risk landscape. This is a challenge which no single archive is currently equipped to address.

We proposed a collaborative approach to managing digital preservation risk, bringing established statistical risk management methods into the digital heritage sphere. Project participants created a structured evidence base, pooling our collective experience to map and explain an interconnected network of risk events, actions and impact on heritage. This allowed us to build a model incorporating a holistic understanding of risk, enabling archivists to prioritise threats and choose the most effective actions to combat them.

The funded project ran from January 2020 to February 2021. The National Archives is now supporting ongoing development of the DiAGRAM risk modelling tool which was the main project deliverable.

Project outcomes

The tools and methods created through this project are now available to benefit the entire archives sector just as our existing services (such as PRONOM) are shared to benefit the whole digital preservation community. The grant of £93,500 from the National Lottery Heritage Fund (NLHF) was key to making this achievable, and getting a broad range of input from the archives sector in the model’s development to ensure that it is as widely applicable as possible.

Our academic collaborators in the Applied Statistics & Risk Unit (AS&RU) at the University of Warwick secured EPSRC Impact Acceleration Account funding for a research fellow to continue work to create a statistical model for digital preservation risk. This funding provided expertise to collaborate with archives to develop a risk analysis tool and supporting statistical model specifically for the archival environment.

The NLHF grant created the opportunity for broad collaboration to include a diversity of organisations from across the UK archives sector, enabling them to benefit from direct engagement with statistical specialists, and ensuring the tools and model reflect the needs of the wider archives community. This diverse group of organisations actively engaged with this project as full partners in the work, and later workshops and webinars brought the draft model to an even wider range of organisations and gave them the opportunity to comment before the end of the project.

The challenges faced by the UK archives sector in the digital arena are primarily resourcing and skills, so this project also complements our digital capacity building programme for the archives sector, Plugged In, Powered Up.

Archives are moving from an era of relative stability in archival practice to one of continual change. Each new generation of digital technology gives rise to a new set of risks with which digital archives must keep pace. There is a widening gap between the resources we have and the resources we need. The techniques that could help close this gap – good risk management, supported by sound evidence and robust statistics – were beyond the reach of the archives sector.

This project started to put these advanced techniques into the hands of archivists, using a co-creation approach that will develop skills, give ownership of the process and outputs and ensure that the models serve our real priorities, heritage and communities.

For more details of the project outcomes see the project’s article in Archives and Records, Volume 42, 2021 – Issue 1: Interdiscplinarity and Archives, pp 58-78  (an open access version is available in The National Archives’ institutional repository) and the Project Evaluation report by the Digital Preservation Coalition.

DiAGRAM – the Digital Archives Graphical Risk Assessment Model

The project produced an integrated decision support system, based around a risk model in the form of a Dynamic Bayesian Network. This tool will allow archives to investigate potential mitigations to digital preservation risks based on their own current circumstances, and communicate the relative effectiveness of different strategies (and the costs of different strategies) to relevant decision makers, funders and other stakeholders in an easy-to-understand way. This will allow archives to evidence their requests for support based on a rigorous model which will have been developed using the experience of a wide range of institutions.

The final prototype DiAGRAM tool is now available (the earlier version can also still be seen). This incorporates user feedback from our online workshop in the summer, and also formal user testing during the second development phase. We are aware of some areas where we need to make further improvements to the user interface and user experience but are happy to receive additional feedback. There is guidance within the tool, and the recordings and slides for the demonstration and workshop session links within the information for our second introductory workshop show a couple of sample scenarios of setting up the tool to reflect a particular archive’s circumstances, and then investigating potential changes to the archive’s policies to improve mitigation against digital preservation risk.

The final update on 16 February 2021 added two integrated example models to help users understand the significance of their risk scores. The first model shows the risks of using only backup to protect your digital materials (a low risk score indicating that material is at high risk), while the second represents a well-established national archive with good practice and highly skilled staff available.

Using only the basic model building tool of answering the questions relating to the input nodes most people using the tool will fall somewhere between these two built-in models.

Users can also use the advanced customisation options to more closely model their archive where they have specific evidence that differs from the elicited probabilities we have used for the conditional nodes. We are still in the process of developing documentation and guidance for using these features, but hope to release this shortly. To see these within the tool, click on the ‘View results’ tab. You can then use the ‘Show answers’ button to view the assumptions for each model. You can also use the ‘Edit’ button to use one of these models as the basis for your own model, or delete them so that the examples are not included in the downloadable images and report.

A screenshot showing the two built-in example models.

A screenshot showing the two built-in example models

2021 revisions

In October 2021 a further phase of work began (funded internally by The National Archives) to resolve remaining web accessibility issues with the prototype. It is expected that further user research will be undertaken in support of this work work from November 2021.

Project partners

The core project team at The National Archives was:

  • Alex Green (Service Owner: Digital Preservation) – project lead
  • Dr Alec Mulinder (Head of Digital Risk, Governance and Standards)
  • Dr Sonia Ranade (Head of Digital Archiving)
  • David Underdown (Senior Digital Archivist)
  • Hannah Merwood (Research Assistant in Applied Statistics)

From the Applied Statistics and Risk Unit (AS&RU) at the University of Warwick:

Archives partners are:

Project evaluation is being provided by the Digital Preservation Coalition.

This project is being funded by the National Lottery Heritage Fund and the Engineering and Physical Sciences Research Council (EPSPRC) Impact Acceleration Account.

Introductory research posters and presentations

We have already produced several research posters and presentations during the development phase of the project in order to generate interest and share our early work as widely as possible, and have added further presentations during the funded phase of the project.

These include:

Project timetable

Development of DiAGRAM was initially expected to be complete by August 2020, and we had a prototype available at that point. As the COVID-19 reduced the travel associated with the project, we have been able to repurpose some of our funding to undertake further usability and accessibility work on the online tool, and the final prototype was released in November 2020.

The project kicked off with a workshop involving all partners on 29 January 2020. This built on work already undertaken during project development, which has given us a fairly good sense of the overall shape of the Dynamic Bayesian Network which will underlie the risk model. We reviewed the network, detailed definitions for each node in the network and saw an early prototype of the online graphical interface to gather initial user feedback.

Feedback included ways of showing a simplified ‘high-level’ view of the network to start with (perhaps aligning more directly with digital preservation models already more familiar to archivists), and ways that the online interface might capture initial details that will vary from archive to archive, such as service type and location (which will have a bearing on physical threats such as flood risk). William Kilbride from the Digital Preservation Coalition also introduced their proposed evaluation approach for the project.

Following this workshop the core project team continued to develop the underlying Dynamic Bayesian Network and gathered statistical information, and also continued work on prototyping a user interface.

A second workshop was held (online) on Tuesday 28 April and Wednesday 29 April. This allowed the wider set of project partners to review the work undertaken since the first workshop. The main purpose of this workshop was to undertake structured elicitation of expert judgement using the IDEA protocol to develop mathematically sound statistical data for aspects of the model where it does not otherwise exist. We were pleased to be able to welcome additional experts from the BFI (British Film Institute) and Cambridge University Library as well as those from the formal project partners.

Hannah Merwood published Risk Alert: Insufficient Technical Metadata on the DPC’s blog on 20 May 2020. This examines the development and refinement of our thinking on this aspect of the model and how it fits into the Bayesian Network.

The prototype release of DiAGRAM was completed and tested by project partners prior to a series of dissemination and training events in June and July 2020. This will make DiAGRAM available to the wider archives sector and digital preservation community. We will then incorporate feedback on the model from these events into a second release at the close of the project. We hope to have an final in-person event as a project close, and a webinar in October looking in more detail at the theoretical underpinnings of the project.

Due to the likelihood of ongoing restrictions related to the COVID-19 pandemic we set the workshop times with the aim of giving a choice to those with additional caring responsibilities. The second workshop was also at a time more accessible to those in the Americas, and the third for more easterly timezones. Event information, including slides and recordings, are available via the DPC website:

  1. Tuesday 16 June 2020, online, 13:00-17:00 BST (UTC+1) [slides and other materials on event page]
  2. Friday 17 July 2020, online, 15:00-19:00 BST (UTC+1) [recordings, slides and other materials on event page]
  3. Monday 27 July 2020, online, 09:00-13:00 BST (UTC+1) [slides and other materials on event page]

The postponed 2020 ARA Conference finally took place as an online event in September 2021 and included a workshop on the DiAGRAM tool.

Throughout the life of the project, members of the project team also presented at relevant conferences and events, and provided updates through this project page, The National Archives’ blog and other appropriate channels.

Project background

The National Archives has been in the vanguard of digital preservation worldwide for the last 20 years. Our tools and services have been adopted across the international archival community. The most prominent of these is PRONOM, a technical registry of file formats, software and components that are required to support long-term access to digital records. Through its use and development we now collaborate with more than 120 archives around the world. We are world-leading pioneers in digital transformation in heritage, and are one of only a handful of fully functional digital archives in the world.

Following the adoption of our current digital strategy in 2017, we are undergoing a strategic transformation to become a second generation archive that is ‘digital by instinct and design’. As part of this transformation, we are actively learning from outside the archives sector and adopting relevant techniques from other domains to build a natively digital practice that transcends the legacy of our paper collections and delivers the best outcomes for all types of digital records. We have placed users at the very centre of what we do, enabling multiple perspectives on a digital record’s creation and continuing evolution through use.

As part of this transformation of our digital archiving practice, we are adopting an approach to managing digital preservation risk that is firmly built on evidence. Our initial work internally, and with statisticians at AS&RU, has demonstrated that advanced statistical methods can be applied in the archival environment, enabling our digital heritage to benefit from the same mature approach to risk management as other complex, rapidly evolving fields with high degrees of uncertainty. For example, these techniques have previously been used to help safeguard the UK’s food security and improve the safety record of oil rigs. The National Archives is committed to applying these quantitative, evidence-based methods to help deliver better, more cost effective digital preservation outcomes for the heritage with which we are entrusted.

Project development

Research into this question began within The National Archives in June 2017 with staff familiarising themselves with the theory of Bayesian Networks and reviewing existing (qualitative) approaches to risk management within digital preservation (for example: Rosenthal et al’s 2005 D-Lib Magazine paper, DRAMBORA [2008], SPOT model [2012]). This was followed by a risk mapping exercise to identify risks within our Digital Repository Infrastructure, and establish relationships between those risks. We then attempted to model a small area of the risk map as a Bayesian Network.

12 linked boxes containing probability graphs which show how different factors such as the type and age of digital storage media, along with the impacts of environmental storage conditions, such as the exposure of the storage media to high temperatures, high humidity etc might affect the ability of an archive to render a digital file in the long-term.

Diagram 1: This shows the prototype Bayesian Network modelling various factors which affect the long-term renderability of digital files, such as the type and age of storage media, and various environmental factors that could impact on the reliability of the digital storage. Probabilities shown here are illustrative only.

12 linked boxes containing probability graphs which show how different factors such as the type and age of digital storage media, along with the impacts of environmental storage conditions, such as the exposure of the storage media to high temperatures, high humidity etc might affect the ability of an archive to render a digital file in the long-term. A scenario has been set up to investigate the effect on risk if the archive decided to move all storage to hard disk , and those disks were then exposed to high levels of magnetic flux.

Diagram 2: This shows the prototype model set up to investigate the effect on risk of moving storage hard disks, and the impact if those disks were then exposed to high levels of magnetic flux. The model indicates that there would still be a small reduction in the risk that digital files could not be rendered in this scenario.

It quickly became apparent that we would need to develop stronger statistical skills in order to be able to develop a rigorous, evidenced, model. Fortunately we were then introduced to Professor Jim Smith through the Alan Turing Institute in August 2018. Professor Smith is a Turing Fellow and member of AS&RU.

Following initial discussions with AS&RU we held an internal workshop at The National Archives in November 2018 to build on our initial work and to get a formal introduction to techniques such as expert elicitation, which will help us obtain statistically valid information for the Bayesian Network, even when it is hard to gather data directly to inform the probabilities that need to be associated with particular risks.

Following this workshop we were satisfied that we had a viable approach for developing the risk model.  However, we want to make the model as broadly applicable for archives as possible, so decided to seek partners across the UK archives sector. In order to support their participation we began preparing a bid to the National Lottery Heritage Fund (NLHF) and gathering a group of interested organisations from across the UK archives sector.

An initial (self-funded) workshop for all potential partners was held in November 2019 to introduce the modelling techniques and confirm interest, and the NLHF bid was also submitted in November. NLHF funding of £93,500 was awarded in January 2020.

Cartoon left hand with middle and index fingers crossed, third and little finger folded down over palm, nails of these fingers appear as eyes above a crease on the palm which resembles a smiling mouth.

Supported by The National Lottery Heritage Fund