Skip to main content
Service phase: Beta

This is a new service. Help us improve it and give your feedback (opens in new tab).

Research projects

Developing an enhanced dataset for the history of photography

Researchers Giorgia Tolfo and Katherine Howells reflect on their recent project focusing on the metadata in our photographic copyright collections. They look at the opportunities and challenges of collaborative digital experimentation.

Published 2 October 2025 by Dr Katherine Howells and Giorgia Tolfo

About this image

Gathering waterlilies by P H Emerson, 1885. Catalogue reference: COPY 1/373.

Metadata from The National Archives’ online catalogue, Discovery, offers vast potential for analysis and enrichment using emerging digital technologies. In particular, metadata from our collection of photographic copyright records can, with careful preparation and analysis, open up new avenues for research into the history of photography and intellectual property.

Creation of the dataset

While raw catalogue data can be downloaded from Discovery, transforming it into a useful structured dataset is not a simple task.

Firstly, the exact pieces in the record series COPY 1 which contain photographs and are catalogued to item level need to be identified.

Secondly, downloads from Discovery are limited to 10,000 results. If you wish to export more metadata, the Discovery API needs to be used instead.

Finally, although records can be downloaded as xml (or csv) from Discovery, as it can be seen in the screenshot below, the most interesting information is represented as a single line in the “Description” field. To get the most out of these descriptions, it would be best to transform them from semi-structured to structured data. In other words, create a dataset where all the fields are neatly separated.

This we achieved using regular expressions and other methods and the process is described in more detail in GitHub.

Yet, once the fields were neatly separated, new problems became apparent. For example, in many cases addresses and names appear in different variations which are not easily reconcilable. This variation is caused by a range of factors: changes in names and addresses over time, different people filling in forms at different times and, most importantly the fact that these historical inconsistencies had been preserved through a transcription process which prioritises precision to guarantee historical accuracy as per best cataloguing practices.

A spreadsheet with a number of formatting variations for the name J E Mayall and the address 224-226 Regent Street.

Variations of the same name and address from the catalogue.

Unfortunately addressing these inconsistences is not an easy task, and one that requires careful consideration. Eventually, after some internal experimentation and debate, we opted to preserve these variations rather than standardise them. It was preferable to keep fuzzy but historically authentic data, rather than potentially introduce inaccuracies by editing them.

The combined, structured dataset is now available on GitHub and it can be downloaded as a collection of tsvs or jsons, or as a combined csv.

The dataset provides a more accessible and useful way into the COPY 1 collection, and will allow researchers to approach the collection in ways that have up to now been extremely difficult. We hope that this new dataset will open up exciting new avenues of research as our recent hackathon has already shown.

Collaborative experimentation through a hackathon

On 27-28 January 2025, a group of enthusiastic researchers with a range of digital humanities expertise joined us at The National Archives for the 'Collaborative Digital Exploration: exploring the copyright world of early photography' hybrid hackathon. Using an early version of the dataset referred to above, the participants split into groups and worked together to explore interesting ways to process, enrich and visualise the metadata.

Named Entity Recognition and Semantic Classification

This project extracted names of entities, specifically people, from the metadata and experimented with ways to link them to WikiData, opening up opportunities for enriching the data and improving ways of searching and exploring through linked data. Participants also explored methods of semantic tagging to classify terms found in the photograph descriptions.

Identifying, visualizing, and linking place names

The team in this project explored ways to identify place names using natural language processing and WikiData gazetteers, to open up opportunities for using the metadata for more geographical analysis and discovery.

This project explored how to investigate the thorny issue of mixed-media registrations (where items registered could be photographs of artworks, engravings of photographs, or various other connected forms). Using Google NotebookLM, the team experimented with classifying the different forms and adding cultural and historical context to enrich the metadata.

From description to images and from images to description: exploring generative AI

This experiment tested tools like ChatGPT and Google Gemini to generate descriptive metadata and keyword tags. While promising, the team noted issues with hallucinations and factual inaccuracies in the generated metadata.

Accessing the collection through open standards: a IIIF viewer

A IIIF (International Image Interoperability Framework) viewer was developed to navigate and annotate digitised COPY 1 images and the team discussed the potential benefits of this mode of access and exploration.

Stereoscopic photographs viewer

The team created animated GIFs from pairs of stereoscopic photographs to simulate 3D viewing.

Why this is important

While not always leading to a tangible final output, hackathons are an exciting way to develop and test ideas. They are spaces for free exploration and collaboration where obstacles and desirables are discussed.

The experiments outlined above show that the dataset offers a unique opportunity to explore the world of early photography and the history of intellectual property. It can be used to enrich other existing metadata collections, while also being enriched by them. Spatial distribution of photographic studios and photographers, relationships between copyright owners and authors, recurring patterns in photographic practices and visual culture, relationships between textual captions and visual images - all of these ideas and many more can be investigated by engaging with this dataset, with a bit of creativity, knowledge and the necessary technical skills.

The hackathon gave us the opportunity to set aside time to create the dataset and make it public, while also testing its usefulness with participants with whom we have created new relationships.

But not only that, it also reinforced the importance of catalogue metadata for research purposes. While the hackathon’s main focus was the photographic subset within COPY 1, The National Archives holds many more other collections which could be enhanced through a similar process and used to open up new research avenues. Proving the importance of metadata for research purposes is yet another way to imagine new research scenarios and to highlight the invaluable and essential work of archivists and volunteers, who have generously contributed to make these records accessible over time.

Accessing the dataset and future research

The dataset is now openly available and downloadable from GitHub and anyone is welcome to engage with it independently.

Some of the research ideas we’d like to explore in more details in future are:

  • Entities extraction, ontologies
  • Wikimedia enrichment, wikithons
  • Digital storytelling
  • Digital annotation
  • Use of metadata as research data
  • Opportunities and limits of working with fuzzy data
  • Interactive interfaces
  • History of intellectual property and early photography

We are also interested in collecting examples of digital explorations and welcome submissions of ideas to enrich the portfolio of research examples on GitHub.

If you have ideas to share, or are interested in exploring collaborative research, please do get in touch contacting Katherine Howells, Giorgia Tolfo or research@nationalarchives.gov.uk

Authors