Episode two: Digitisation, digital archives and preservation

In this episode of the Annual Digital Lecture audio series, we hear from specialists in preservation, digitisation, digital archiving, as well as web archiving. We think about digitisation as a way of preserving our records and a way to enable access to access fragile records.

We also explore what considerations and challenges are involved in the acquisition and cataloguing of digital records, and consider how care is embedded in practices of sustainability, labour and trust.

Listen now



[Start of recorded material at 00:00:00]

Lily:  Hello and welcome to the Annual Digital Lecture podcast. My name is Lily and I’m the academic engagement officer here at The National Archives in Kew. For the past six years, The National Archives has hosted the Annual Digital Lecture, an event where we welcome leading speakers to talk about digital research and practice, in addition to highlighting some of the innovative digital work happening here at The National Archives. This past year’s lecture hosted on the 28th of November 2023, was delivered by the Creative Studio Identity 2.0 who explored self-care and memory making in a digital age.

Building on these discussions from November, this podcast explores how care is part of the innovative digital work happening at The National Archives. In each episode, we’ll have a conversation with colleagues across a range of different departments and teams who will lend their expertise to talk about how the digital and care intersect in their everyday work. From the care for our records through preservation and processes, to the care for people who work with and use our records. Join us to reflect and to look to the future.

In this second episode, we’ll be hearing from specialists in preservation, digitisation, digital archiving, as well as web archiving. We’ll think about digitisation as a way of preserving our records, as well as a way to enable access to records that are fragile. We’ll look at born-digital records, exploring what considerations and challenges are involved in their acquisition, cataloguing and sharing. And also consider how care is embedded in practices of sustainability, labour and trust.

So today, I’m really excited to be joined by Sarah Noble, who is our Head of Conservation for Imaging, Tom Storer, Head of Web Archiving and Sonia Ranade, our Head of Digital Archiving. Let’s start from digitisation. Sarah, you’re Head of Conservation for Imaging. Can you first of all just give us a definition of digitisation?

Sarah: Yes, hi. Here at The National Archives, when we talk about digitisation, what we are referencing is a process where we create digital images of our original records using cameras and scanners. These digital images can then be archived for long term storage as well as shared online.

Lily: So what gets digitised at The National Archives and how is that chosen?

Sarah: Yeah, of course. What gets digitised here at The National Archives is often dependent on the appetite of our external partners. We work with many academic and family search publishing partners. We also work with universities, archives and libraries around the world who approach us wanting to digitise our records and then publish that content online. The National Archives also puts together tenders that focus on key content that we would like to have digitised and published. And such examples are the 1921 census, First World War diaries and MOD, Ministry of Defence service records.

Digitising our collections also feeds very nicely into our key strategy of archives for everyone, as it opens up access to our records to a lot more people on a global scale. Conservation fits into the digitisation process as we are mainly responsible for the care of our collections. So we make sure that everything that is being requested for the digitisation process is safe and stable throughout.

Lily: Can you give us some examples?

Sarah: One example of where conservation has helped enable access was when we found some very badly burnt pages within some bound volumes. These pages were so badly burnt that every time you tried to touch them, the pages would crack and split and start fragmenting. However, there was content that was on the pages that we still wanted to capture. So we did some testing and found that if we could line the burnt pages with a very thin Japanese tissue, Japanese tissue is very strong due to its long fibres, using a solution of glycerol which is an adhesive, in an isopropanol solution, that’s a solvent, to adhere the tissue to the papers.

Then they become much more stable and can be handled with care, so that they can be imaged. Another example of where we’ve helped enable access to fragile collections is when we were imaging several hundred First World War and Army of Occupation war diaries. Within the war diaries, we found several large transparent paper maps. However, they’ve been folded and over time they’ve become very brittle and have broken into many pieces. So we wanted to image these but to do that we needed to put them back together and then repair them so they could be handled and imaged. This took many hours. It’s quite slow and monotonous work but we managed to piece them together.

And then we would adhere them together, again using Japanese tissue, strips of repair tissue and a very dry adhesive. Transparent papers are innately very sensitive to any kind of moisture. They’re also transparent, so any repairs we do have to be done quite sensitively. A third example would be when sometimes we come across parchment membranes that have been folded and if they’ve been folded for hundreds of years and not open, they become very stiff and inflexible.

So to be able to unfold them so you can image them without causing more damage, we need to find a way to soften them and relax them. And we do this through a process of humidification. We use a humidification chamber. And once that relaxes the parchment enough, we can then use a system of flattening through either pressing or sort of tension to keep the membrane flat so then it can be imaged.

Lily: What are the benefits to the collection and to the archive from digitisation from your perspective?

Sarah: There are multiple benefits to the collection. One of the main benefits through digitisation is preservation. By publishing our content online, we remove the need to physically handle these collections. Over time, regular handling of collections has a major impact on the condition of physical records. So if we can reduce this, then we help in their long term preservation. We also enable access to fragile collections. This has two-fold benefits in that the conservation enables access and then also by publishing that content online, you’re also enabling access so that people can access these collections from anywhere.

In the past, they’ve had to come on site to see our collection whereas now they can access them. Other key benefits are having conservation involved in these large scale digitisation projects, it can open up avenues for research. These projects, because we look at so much content, can often provide really good case studies for heritage science. They analyse fragile or unknown material for us and this then helps inform the research into new conservation treatments. There is also the benefit that once a collection has been imaged and is accessible online, these collections can be transferred to our off site storage, creating more space for other collections at our site at Kew.

Lily: So Sarah, you’ve spoken to us about the digitisation of physical records. Let’s now move on to born-digital material. And Tom, I’m going to first turn to you as head of web archiving. Could you start by giving us an introduction to web archiving and maybe speak to the differences between web archiving and digitisation?

Tom: Yes, of course. So fundamentally, web archiving is a form of digital archiving, as it involves capturing web content, websites and some social media in a way that it can be preserved and then reused in an archived snapshot. So to do this, we archive all of the necessary code and data that is presented to the human, to the user, via browsers through the use of various tools that capture the content and store it in a static portable file which we then preserve, which is a bit like a zip file really. And then this can be reassembled in a browsable form and just a normal web browser for access.

So it’s not digitisation as such as the content itself is obviously digital at the point that we archive it, but it may have been originally analogue at the point of creation and then subsequently digitised before being archived into the web archive. It’s quite a different process to other forms of archiving capture and transfer we do at The National Archives. As by nature, it’s highly contemporary. So for example, content captured today, such as a resource from the Gov.uk website, may well be available in as little as three days’ time to our users online.

It’s constantly changing and drifting and the technical challenge is really significant due to the ever changing supercharge rate of evolution of the Internet, of the web. And think about the emergence of new social media platforms and their kind of widespread adoption by government departments. We are also proactively taking the content rather than accepting it as a transfer. Our effort is really a recognition of the central role that the web plays in how government, the state interacts with citizens and publicly with itself.

So as such, the web archive is an extremely rich and quite diverse collection, full of context in its own right as a collection. But also for other collections that in time will become available. So think about, for example, what the government says today about a particular initiative for policy and what will be released in the future, say, via emails. In terms of back to digitisation, while we start with a web based digital object, there are some parallels in that we are creating copies or separate manifestations of a thing when we do web archiving for preservation and use.

The one fundamental difference is that we don’t and we can’t keep the absolute original thing, however you decide to define that. Digital is infinitely copyable, so there’s quite a philosophical debate about what the original thing is anyway. So as a record, our web archive captures the process of digitalisation to publication in some cases. So for example, we have some content from the air accidents investigation branch going back as far as 1958, which is published onto the Gov.uk platform as a digitised record. We then subsequently capture that into the web archive. We get this multi-layered approach to archiving the digitised version of the original which is quite an interesting thing.

In terms of experience, our website is different every time it’s viewed by someone on the web. So depending on the browser you use, the device you use and so on, there’s a lot of complexity inherent there. Our websites are really complex, they’re mixed media compound objects made up of lots of different things. I suppose uniquely they are part of a network of things. So even if you are opening up a single web page, you’re actually accessing about 20 different digital items when you do that, that are put together into one thing that you’re looking at. So we need to be able to deal with those in the web archive and not only at one point in time, but multiple points in time.

Lily: Can you tell us a little bit more about the process of web archiving and how you view care within that process?

Tom: So it’s been really interesting to reflect on the effort that we put into the archiving of that original resource, which in turn, when it was originally created, there was a lot of effort put into it. so for example, we have the COVID collection. So we were doing some very frequent archiving of COVID material throughout the pandemic. And we were taking a great deal of care to build a comprehensive collection of everything that the government said online about this, as something with intrinsic historic value and an evidence base of how the state dealt with the situation.

So as I mentioned, web archiving is a very active process and we do not always need to work with creators of the content in order to archive it. But it’s really difficult to get right. The content that we are targeting in web archive is highly dynamic, often complex. And the task is so huge that we spend a lot of time deciding the most efficient way to capture content and the scale of the challenge makes prioritisation very important. So this in itself is really an act of care that we undertake in the team. We automate as much as we possibly can, relying on, as we must, on the technologies available to us today.

But sometimes very detailed painstaking work needs to be done to preserve a website or a resource and stitch things together at quite short notice. We put a lot of care into the quality of what we capture. So, again, to talk about the pandemic, we were capturing content on a daily basis that needed to be captured, checked and then published in a very short period of time because it was frequently changing and we needed to make sure that we were archiving it before the next change. So there’s a sense of familiarity and almost a kind of intimacy that the web archivists get when they’re dealing with sites.

And a big motivation for us is this idea about future researchers asking, “What did the government say now about subject X or Y?” I suppose the last thing to say really about this is I’ve talked a lot about collection. Access is obviously a really important point. So it’s important for accountability, it’s important for providing access to that historic evidence base. And government websites and resources are the same as any others. They suffer from link rot where content is available and then it’s no longer available six months or a year later. And one of the roles of the web archive, another act of care, is to provide a service that can reduce that rot or drift from the original information and help the public to access it.

And last but not least, from a preservation perspective, we’re quite different to lots of other archives and collections because there is inherently quite a lot of duplication within the archive. There’s a sort of a certain school of thought which is called LOCKSS which is lots of copies keep stuff safe. But that creates various issues for us when it comes to access but it’s also arguably, not the most sustainable way of going about what we do.

Lily: The process that you just outlined touches on sustainability. Can you tell us a little bit more about that?

Tom: Yeah, sure. So this is really another aspect of our care or custodianship in this area. And it’s rightly really important, but also raises a number of complex questions on multiple fronts. So our web archivists are constantly making decisions about what we need to archive to make an excellent representation of the thing we’re trying to preserve. And that involves excluding superfluous content. So there’s a bit of a value judgment there where we have some guidelines that we try to follow consistently, basically removing content that is superfluous, that adds very little value to the overall archive itself.

And this is necessary because often the tools will get stuck in websites. Think of search facets, for example, where they will cycle through all the different options and you can quite quickly get to thousands of effectively links that are basically pointing to the same thing. So we want to avoid that as much as possible. And it also involves some basic maintenance work that we do that at the scale we’re operating at is still quite an undertaking. So for example, when a website is no longer being updated, making sure that we don’t keep taking snapshots of it for years and years going forward.

So in general, it’s really about making sure that we get a balance between our use of resources, environmental and otherwise and achieving our mission.

Lily: Sonia, I was wondering whether you could speak some more on the idea of digital archiving and digital preservation as a sustainable act from your perspective of head of digital archiving. What are some of the benefits and challenges of digital archives?

Sonia: Yeah, of course, we can’t keep all the digital records or digital information that we create. That was true for paper too but digital records are created and kept in greater volumes and they’re invisible. You don’t have to keep adding filing cabinets that take up your physical space. We know from working with government departments that lots of departments have large volumes of digital information and we term this the digital heap. Because often it’s not well categorised or the people who understood it have moved on. And that information is expensive to keeping computer drives spinning consumes resource.

So archiving as an activity allows departments to make a selection and that’s a positive decision about what has value and should be kept, but also what should be disposed of. In our approach to digital archiving, we’re still following a paper model in some respects where departments look at what they hold and make that positive selection and then send it into the archive. But we’re also learning from Tom’s approach in web archiving to understand how to take records that the department creates but maybe doesn’t physically hold. So for example, records that are hosted in cloud systems or which may not exist as discrete objects but are created almost on the fly as people work with.

So we’re seeing a shift in our thinking and in that paradigm about how the archive functions and how we work with our creating department. But overall the aim is the same, it’s to make decisions about what should be kept and to get those records that have long term value into the archive. And then conversely also to destroy records that shouldn’t be kept and material that shouldn’t be kept as a sustainable activity.

Lily: There’s definitely a piece on environmental care and consideration with digital archives and material, as well as access. I was wondering what else does care mean to you in the context of digital archiving?

Sonia: Yeah, digital preservation isn’t really a predefined process or a set of steps. It is about actively caring for the records. So we have to work to understand and manage the risks to our records over time. And these threats can arise from the things that we can see that we’re all aware of. For example, can we maintain secure physical custody of our records? Or are our storage media starting to degrade over time? And we tend to be focused on those threats to the integrity of the record. But we also need to think about threats to our ability to work with digital records as technology changes. So for example, will the software or the operating environments we need still be available over time to enable our users to work with those records? And probably the greatest challenge is managing threats to our understanding of those records.

So over time, we would need them to keep within that context about where did they come from? Why were they created? And what can they tell us? And without all three, the records aren’t going to be usable into the future. We did some work to model risks to the digital archive which I’ll talk about a little bit later. We found that the risk to digital records is actually highest before they’re transferred to the archive. So at that point where they’re still held by the creating department but they’re no longer in active use. And a key focus for the digital archiving team has been to create a service that helps government departments transfer records to the archive quickly and securely. And we’re encouraging them to transfer early and little and often, rather than continuing to save up those large volumes of records.

Lily: Tom talked about the concept of the original in web archiving. What’s your take on that in regards to other digital material?

Sonia: Yeah, it’s exactly the same. With digital, there’s no longer a real concept of the original. So Tom touched on that for web archives but it’s true for across all digital forms too. Digital is copyable and once you make that copy, it’s identical. There’s no original. And we archive and keep the copy that we receive. Like Tom, we’re also seeing more duplication because digital records are easier to share across government departments. And so we may find that things come in multiple times from the same department or from different departments.

But where we add value is in the way we think about provenance. So it’s no longer about whether an object is the original but it’s about knowing where it came from and what happened to it along the way and maintaining that is what contributes to trust in the archive and trust in those records. We make copies too for access. Sometimes we might receive something in a form that’s not really going to be easy to present through a browser or for somebody at home to be able to access. And so we don’t always serve up the original in its original form, we really do try and add value for the user.

But that copy that we originally took is always available and we’re always transparent about that and that could be requested if that was something that a user wanted to see.

Lily: And so now moving away from the records themselves, I was wondering if we could look at the human side of the archive, at the people working behind the scenes to create and maintain these archives. There’s often a lot of invisible labour in digital spaces.

Sonia: Yeah, absolutely. We think about digital archiving as a technology focused activity and we do use a lot of technology but everything we do relies on our people. We have a really diverse team. Archival skills and archival thinking are always going to be important and they’re still at the heart of our work but we need a range of other expertise too. So we have colleagues with expertise in systems development, in data engineering, infrastructure design, security operations and we work beyond our own organisational boundaries.

So we’ve built some really strong external partnerships which bring in additional skills and different perspectives in areas like statistics, natural language processing, digital humanities. And we have some active partnerships with academic organisations. And I personally love supporting research students as well, that’s really rewarding. So it takes a lot of people with a lot of different skills to build and run a functioning digital archive.

Lily: Tom and Sarah, is there anything that you’d like to add from your perspectives?

Sarah: Similar to what Sonia said, it’s all down to having the right people with the right skills. And actually resourcing and staffing is a key challenge for us. Finding skilled conservators that want to work in an archive within digitisation is quite a challenge, especially as conservation for digitisation is quite new in the field of conservation. And there are only a few of us that are actually starting to create a career out of it because t’s quite different from the traditional idea of being a conservator. But hopefully, that will change over time and more people will move into the area of digitisation for conservation and see that there is a good career path.

Tom: Yeah. And in terms of web archiving, absolutely. I think as quite a niche activity, it’s been encouraging, similarly encouraging in the last few years. I’ve been web archiving at National Archives for 15 years. And in that time, the kind of web archiving community outside these walls has grown greatly. So there’s a bigger community of practice and support and so on, the overall mission of archiving the web which is really interesting. The specialisms, the skills and specialisms that it takes to archive the web do take a little while to learn. But we see a lot of training going on and lots of people understanding what it takes to archive these resources.

Lily: So before we go, I have one final question I would love to hear all of your thoughts on. What current research or conversations are happening in your fields that you’re excited about?

Sonia: Yeah, so we do lots of research. So we do research obviously into digital preservation, research into how we can present really complex digital records in a way that’s really usable. But I wanted to pull out two examples that I’m really proud of. So I mentioned earlier that we built a risk model. This was a statistical model. It used a Bayesian statistical approach and it’s helping us to understand and manage the threats to our records. I love this one because it was delivered through a partnership, so we had technical specialists on-board and we had archivists from right across the sector from different types of archives contributing to our expertise.

The statistical techniques, the Bayesian statistics, that’s a really well established approach. But I don’t think it had been applied in an archival context before. So that was really good to be able to bring in expertise from other fields and apply it to what we’re doing at The National Archives. And then the second example that I would like to highlight is our user research. So we’ve made a real shift in the way that we work towards putting our users at the heart of our services. And our user researchers made that possible through working with stakeholders to really understand their needs, so that we can make sure our services are intuitive and that we deliver value. And that’s just been something that’s been so rewarding to see.

Lily: Tom?

Tom: So in web archiving, what we’re really excited about at the moment is ways of searching our collection. So we’ve got over six billion objects that make up the collection. So these are all the really significant objects, reports and perhaps videos and images and stuff. But it’s also all of that code and it’s all interesting to someone, so everyone’s going to find it interesting in different ways. But obviously, searching across a massive collection like that is really challenging. And we have a search service at the moment which does a fairly good job as a traditional search service. The thing that we are most interested in at the moment and really excited by is being able to search across this collection in a way that gives our users a better sense of what’s in it, what’s not in it, themes and concepts rather than merely just keywords and so on.

So there’s quite a lot of work going on in my team about that at the moment and we hope to have something that we can talk about a bit further in the near future.

Lily: And Sarah, what about you?

Sarah: I think in the world of conservation, the growth of heritage science and the support that they give us through the providing analysis for research so we can look at new conservation treatments is really key for us to grow and change. And also find new ways to treat and prepare collections en masse which is often what we do here at The National Archives. I would also say the other key area that is something really interesting to look into is working with other institutions that are also digitising their original connections on large scale in Europe, over in America. And seeing how they do it, how they manage the process, how they manage their treatments and what kinds of equipment they use to image their collections.

I think that for me is a really key area for us to look at to see if we can change and grow and improve what we already do.

Lily: Thank you for listening to our annual digital lecture podcast and thank you to our experts for taking the time to talk to us today. To learn more about the Annual Digital Lecture and watch recordings of our previous lectures, click the link in the text on the episode page or visit nationalarchives.gov.uk and search for Annual Digital Lecture. If you’re interested in learning more about our research, as well as our work as an independent research organisation, visit our website, national archives.gov.uk and search for research and academic collaborations.

Follow us on X at UK National Archives Research to stay up-to-date with our research projects, upcoming events and other opportunities. And don’t forget to read our blogs at blog.nationalarchives.gov.uk. This audio of recording from The National Archives is Crown copyright. It is available for reuse under the terms of the open government license.

[End of recorded material 00:29:21]