Digitisation

Information on statutory obligations

It is the responsibility of all organisations responsible for public records, as defined by the Public Records Act 1958 (c. 51), to ensure the safekeeping of public records prior to their selection for transfer to The National Archives or other approved place of deposit, under the guidance and supervision of the Keeper of Public Records.  This covers all administrative and departmental records belonging to His Majesty, held in any format, including paper, digital, audio, film or model format.

The National Archives provides guidance and supervision to public record bodies on the safekeeping and selection of public records.

These obligations originate in section 3 of the Public Records Act 1958 (c. 51). See The National Archives’ Legislation and regulations pages for further information.

Where departments choose to digitise records, for whatever reason, both the original record and the digitised version are public records and so must be disposed of under the terms of the Public Records Act (PRA). Before embarking on any digitisation project, The National Archives (TNA) would recommend that plans are shared with The Advisory Council on National Records and Archives (ACNRA), to help the Departmental Records Officer (DRO) ensure independent oversight and that all relevant guidance is followed. This is particularly important if:

  • digitisation is likely to cause damage to the original physical record;
  • the intention is to destroy original records that were selected for transfer to TNA;
  • the benefits to the public purse of digitising records with either very high or very low selection rates are unclear; or
  • the outcome of the digitisation project does not help develop ways of the department using AI in selection techniques.

Some departments have chosen to digitise paper records to business standards to allow for appraisal to be undertaken off site before the original paper versions are transferred to TNA. Further guidance is available below on digitisation for business use.

Should a department wish to keep digitised records after transferring the originals, these would need to be retained under the PRA via ACNRA.

In exceptional cases (e.g., irreversible deterioration or contamination of original paper records), if a department intends to digitise selected records and destroy the originals, prior agreement should be sought from TNA and ACNRA. The DRO must ensure that the full content and context of the original record is captured, that all text is included and legible, and that the entirety of the document can be seen with no information missing. There must be proper control and oversight of the process and a clear and demonstrable chain of custody, to ensure that the digitised version would be adequate should it need to be presented as evidence in a court of law or to an official Inquiry.

The DRO must also ensure that metadata has not been corrupted and that all related guidance has been followed. There must be an end_date for digital images, as the date last modified would only indicate the date the image, rather than the original record, was made, which would affect the viability of the digitised public record for selection purposes.

Alongside any consideration for digitising paper records, TNA would recommend that departments have a comprehensive appraisal policy and a TNA-agreed transfer plan covering records of all formats. Risks of non-compliance with the PRA should be added to the corporate risk register and senior leadership made aware of escalation routes via TNA or ACNRA.

Should you require further technical guidance on digitisation, for example to supply to a professional digitisation company, please contact the Government Help Point at governmenthelppoint@nationalarchives.gov.uk.

Digitisation for business use

Before undertaking any scanning for business use, Departments should consult with their DROs. This will ensure that the provenance and evidential value of the records can be determined before the original format of the records are destroyed or altered so as to render the provenance questionable or so that they cannot be used as evidence for legal purposes.

Whichever scanning supplier you use, you should always ask to see a sample set of scans and Optical Character Recognition (OCR) output if this is part of the project, before the scanning begins in earnest. This sample stage allows you to ensure you are happy with the quality of their output and is the best time to request changes to their process if needed. You should repeat the sample process as many times as required to be sure that they have integrated your required changes before the project is in full production mode.

Once in full production, checks should be routinely made for quality of image, expected file format standards, adherence to file-naming conventions and folder hierarchy arrangements. Pragmatically, checks would be made on a subset of randomly selected image files from across batches newly received from scanning suppliers. This is in order to ensure that the supplier continues to meet agreed standards. Individual files or batches not conforming to agreed standards should be returned to the supplier for re-working.

File naming

Before beginning the scanning process consider a logical file naming system that the scanning supplier can apply to the images as they are created. You may want to browse and retrieve images by file name alone and a series of purely numbered images such as 0001 could make that very difficult. File names should be:

  • Unique
  • Consistently arranged
  • Made up of letters and/or numbers
  • The only non-alphanumeric characters used (if required) should be underscore _ to minimise issues with use across operating systems
  • Not overly long or complex, to avoid character limitation issues in certain operating systems and also to avoid human error at the point of file naming

Image format standards

When choosing the correct format for digitised images, you should consider whether it allows lossless compression. This means the compression of the image without any loss of image data. This allows for a reduction in storage space required for your images whilst ensuring retention of all image data.

By comparison, lossy compression reduces the size of an image by discarding data which means data will be irretrievably lost i.e. even if you reversed the compression you would not retrieve all of the original image data.

JPEG2000 format is used by some institutions including The National Archives for a number of reasons including a greater level of lossless compression (than, for example, TIFF format) and because it is open source. However, to view JPEG2000 images you would need to download a free viewer such as IrfanView: https://www.irfanview.com/ as Windows does not come with a standard application that would enable you to open JPEG2000.

You may also want to consider baseline RGB TIFF 6.0 or higher format. TIFF is a popular image format that your scanning supplier should be comfortable generating. It has lossless capabilities and can be understood by a multitude of software.

If applying compression to your images to reduce file size, lossless compression is recommended.

  • For JPEG2000 use lossless compression.
  • For TIFF files we recommend ZIP compression, but LZW compression is also an option if preferred.

Lossy compression results in discarded information that can never be regained i.e. even if you reversed the compression you would not retrieve all of the original image data. It also
negatively impacts on the ability to carry out Optical Character Recognition (OCR), which enables the content of your scanned image(s) to be searchable.

DPI and PPI standards

DPI is dots per inch and is actually referring to the output resolution i.e. the dot intensity of the image when it is printed.

PPI is pixels per inch and refers to the input resolution of the image.

There can sometimes be confusion between PPI and DPI, particularly when manufacturers use these terms interchangeably.

The National Archives select PPI where possible. Use of lower PPI than required above can cause a loss of information, poor user experience and an inability for OCR to be successfully carried out on scanned images. PPI requirements vary according to the format of the material to be scanned and are detailed below.

  • For ordinary documents, use a minimum of 300 PPI.
  • For photographs, use a minimum of 600 PPI.
  • For photographic transparencies, use a minimum of 4000 PPI.
  • For microform aim for a resolution equivalent to a minimum of 300 PPI at the size of the original document. Take into account the reduction ratio usually recorded on the first frame of the microform).

Physical dimensions

Images should be single page, unless information crosses both pages. If a single scan cannot capture the page in its entirety, there should be sufficient overlap to allow users to determine clearly which of the separate digital images form the whole of the original paper page.

All scans should be size-for-size (for microfilm this refers to the size of the original), with a sufficient clear border/margin to demonstrate to users that the entire page has been captured.

Colour, legibility and other image scanning standards

For microform material, scan images in 8 bit grayscale using the Enumerated greyscale colourspace profile.

For all other material, scan images in 24 bit colour using the Enumerated sRGB colourspace profile.

All digital images should be legible and at least as readable as the original image from which they are derived.

All images should be viewed immediately after scanning as a check on satisfactory capture (for example images complete or not inverted) and rescanned if required.

Metadata

While carrying out the scanning process the scanning supplier should be asked to capture a certain level of metadata regarding the scanning process. This ensures compliance with BS 10008 Evidential Weight and Legal Admissibility of Electronic Information – Specification which requires that, during the scanning process, ‘Associated scanning information (e.g. time, date, scanner number, number of pages) shall be created and retained’.

Additionally, capturing metadata related to the images created that will allow you to manage your records and ensure that there has been no loss or duplication during scanning.

As part of the metadata generation process it is best practice to document how you have generated and captured your metadata and what control procedures have been used to ensure its consistency and to guarantee authenticity.

The National Archives has created a tool which captures this type of information: DROID. It can be run over a batch of images to provide information on the files such as file name, file path, file format and extension. It will also generate a unique SHA 256 checksum for each file. A checksum is a ‘fingerprint’ for each file, made up of letters and numbers. No two checksums will be the same unless the content and format of a file are identical (the checksum does not relate to the filename, so if you had two differently named files with identical checksums, then you would know they were duplicate files in terms of content). You can generate new checksums for a file and check it against the original checksum, to ensure that the content of the file has not changed. A change in checksum will mean either someone has changed the content of the file and then saved it or that the file has been corrupted in some way. A change to the filename will not change the checksum. Checksums are not a good way to determine the intellectual content of images. Images of the same object will often have different checksums due to automated image capture information such as timestamp of image generation. It is important to generate a SHA 256 checksum after the image is created in order to be able to check that the file remains the same over time and to be able to determine if it has changed. This information assists both in managing the files and with digital continuity.

DROID cannot generate dates related to the original record from which the image has been created, so ensure you capture relevant dates relating to the record within the metadata as part of your metadata capture process. The National Archives provides further guidance on its website on digital continuity.

DROID is available to download for free from The National Archives website.

Optical Character Recognition (OCR)

In order to achieve optimum results in OCR it is important that you follow the guidance on capturing at the appropriate PPI as listed above (or higher PPI than listed if preferred) and that if any compression takes place, it is lossless compression.

Some other points to focus on are:

  • Use your high quality TIFF/JPEG2000 as the input for your OCR – rather than providing a compressed PDF.
  • Ensure that your scans do not have noise being introduced by the scanning process – such as spots and dark patches which are not on the originals. These types of marks considerably reduce the quality of the OCR output.
  • You must verify and assess the accuracy of the OCR output – do not assume it will be perfect because it is machine generated.
  • Try to use OCR software which outputs to an industry standard XML format such as PAGE. This will provide a lot more metadata about the results, such as character coordinates, which could be useful in the future. It would also enable the use of software which can view the OCR output overlaid onto the original image which makes the verification of the OCR output process much easier.
  • Ensure that the XML output is retained and can be connected to the image from which it was generated. This could be achieved by following the same naming convention for the OCR output as for the file to which it relates; but ensure you also add an extra identifier to the file name to show it is OCR so that you do not encounter issues with filename duplication with the image itself.

Digitisation for a Public Inquiry

Should the DRO decide to digitise paper records required for a Public Inquiry, the department should consider taking legal advice to ensure that any digitised record is acceptable as evidence and in terms of legal admissibility. Records shared with a Public Inquiry should be selected for permanent preservation and transfer to TNA, making it clear how the department handled the records.