Metadata from The National Archives’ online catalogue, Discovery, offers vast potential for analysis and enrichment using emerging digital technologies. In particular, metadata from our collection of photographic copyright records can, with careful preparation and analysis, open up new avenues for research into the history of photography and intellectual property.
The research potential of copyright records
The information contained in records of the registration of copyright with the Stationers’ Company between 1862 and 1912 is a potentially rich source for studying early British photographic history, as well as legal, business, and cultural history. This is because the information the records contain gives us an insight into the people and organisations involved in early photographic and other creative industries, their locations, their relationships with one another and their creative and commercial interests.
After the passing of the Fine Arts Copyright Act in 1862, people wishing to register copyright in a photograph (or another visual artwork) had to submit a form to the Stationers’ Company containing a description of the photograph, their name and address and the name and address of the photographer. In many cases they would also attach a copy of the photograph to the form before submitting. These forms and photographs were eventually transferred to the Public Record Office, and are now held at The National Archives in record series COPY 1.
-
- From our collection
- COPY 1
- Title
- Copyright Office: Entry Forms, etc.
- Date
- 1837-1912
Examples of a deposited image and its associated registration form
Image 1 of 2
Gathering waterlilies by P H Emerson, 1885. Catalogue reference: COPY 1/373.
Image 2 of 2
The registration form for Gathering Waterlilies.
As a result of this copyright registration system, in COPY 1 there are over 133,000 individual entry forms containing this valuable and fascinating information. Crucially, through the hard work of volunteers over several years, the entry forms for the registration of photographs have been fully transcribed into Discovery, making this rich metadata available to researchers.
Since this metadata has been made available on Discovery, researchers have been able to search the information using keywords. However, for a long time we have recognised that the research potential of this information goes much deeper than enabling simple keyword searching. Through the wide range of digital methods of data analysis available to us now, there could be many more opportunities to use this metadata to illuminate historical trends, surface new and little-known stories and much more. To invite new perspectives on how we could maximise the potential of this metadata, we have now released the full dataset on GitHub. Researchers can download it freely and experiment with it.
github.com
COPY1 hackathon dataset
Download and experiment with the full dataset.
The release includes information on how the dataset was pre-processed, the choices made when structuring it, as well as some examples of work that can be done with it that was developed as part a hackathon held at The National Archives in January 2025.
Creation of the dataset
While raw catalogue data can be downloaded from Discovery, transforming it into a useful structured dataset is not a simple task.
Firstly, the exact pieces in the record series COPY 1 which contain photographs and are catalogued to item level need to be identified.
Secondly, downloads from Discovery are limited to 10,000 results. If you wish to export more metadata, the Discovery API needs to be used instead.
Finally, although records can be downloaded as xml (or csv) from Discovery, as it can be seen in the screenshot below, the most interesting information is represented as a single line in the “Description” field. To get the most out of these descriptions, it would be best to transform them from semi-structured to structured data. In other words, create a dataset where all the fields are neatly separated.
Image 1 of 2
The description of the Gathering Waterlilies as it shows in Discovery.
Image 2 of 2
The same catalogue description broken down into a structure with fields neatly separated.
This we achieved using regular expressions and other methods and the process is described in more detail in GitHub.
Yet, once the fields were neatly separated, new problems became apparent. For example, in many cases addresses and names appear in different variations which are not easily reconcilable. This variation is caused by a range of factors: changes in names and addresses over time, different people filling in forms at different times and, most importantly the fact that these historical inconsistencies had been preserved through a transcription process which prioritises precision to guarantee historical accuracy as per best cataloguing practices.
Variations of the same name and address from the catalogue.
Unfortunately addressing these inconsistences is not an easy task, and one that requires careful consideration. Eventually, after some internal experimentation and debate, we opted to preserve these variations rather than standardise them. It was preferable to keep fuzzy but historically authentic data, rather than potentially introduce inaccuracies by editing them.
The combined, structured dataset is now available on GitHub and it can be downloaded as a collection of tsvs or jsons, or as a combined csv.
The dataset provides a more accessible and useful way into the COPY 1 collection, and will allow researchers to approach the collection in ways that have up to now been extremely difficult. We hope that this new dataset will open up exciting new avenues of research as our recent hackathon has already shown.
Collaborative experimentation through a hackathon
On 27-28 January 2025, a group of enthusiastic researchers with a range of digital humanities expertise joined us at The National Archives for the 'Collaborative Digital Exploration: exploring the copyright world of early photography' hybrid hackathon. Using an early version of the dataset referred to above, the participants split into groups and worked together to explore interesting ways to process, enrich and visualise the metadata.
Named Entity Recognition and Semantic Classification
This project extracted names of entities, specifically people, from the metadata and experimented with ways to link them to WikiData, opening up opportunities for enriching the data and improving ways of searching and exploring through linked data. Participants also explored methods of semantic tagging to classify terms found in the photograph descriptions.
Identifying, visualizing, and linking place names
The team in this project explored ways to identify place names using natural language processing and WikiData gazetteers, to open up opportunities for using the metadata for more geographical analysis and discovery.
Artform classification in copyright descriptions
This project explored how to investigate the thorny issue of mixed-media registrations (where items registered could be photographs of artworks, engravings of photographs, or various other connected forms). Using Google NotebookLM, the team experimented with classifying the different forms and adding cultural and historical context to enrich the metadata.
From description to images and from images to description: exploring generative AI
This experiment tested tools like ChatGPT and Google Gemini to generate descriptive metadata and keyword tags. While promising, the team noted issues with hallucinations and factual inaccuracies in the generated metadata.
Accessing the collection through open standards: a IIIF viewer
A IIIF (International Image Interoperability Framework) viewer was developed to navigate and annotate digitised COPY 1 images and the team discussed the potential benefits of this mode of access and exploration.
Stereoscopic photographs viewer
The team created animated GIFs from pairs of stereoscopic photographs to simulate 3D viewing.
Why this is important
While not always leading to a tangible final output, hackathons are an exciting way to develop and test ideas. They are spaces for free exploration and collaboration where obstacles and desirables are discussed.
The experiments outlined above show that the dataset offers a unique opportunity to explore the world of early photography and the history of intellectual property. It can be used to enrich other existing metadata collections, while also being enriched by them. Spatial distribution of photographic studios and photographers, relationships between copyright owners and authors, recurring patterns in photographic practices and visual culture, relationships between textual captions and visual images - all of these ideas and many more can be investigated by engaging with this dataset, with a bit of creativity, knowledge and the necessary technical skills.
The hackathon gave us the opportunity to set aside time to create the dataset and make it public, while also testing its usefulness with participants with whom we have created new relationships.
But not only that, it also reinforced the importance of catalogue metadata for research purposes. While the hackathon’s main focus was the photographic subset within COPY 1, The National Archives holds many more other collections which could be enhanced through a similar process and used to open up new research avenues. Proving the importance of metadata for research purposes is yet another way to imagine new research scenarios and to highlight the invaluable and essential work of archivists and volunteers, who have generously contributed to make these records accessible over time.
Accessing the dataset and future research
The dataset is now openly available and downloadable from GitHub and anyone is welcome to engage with it independently.
Some of the research ideas we’d like to explore in more details in future are:
- Entities extraction, ontologies
- Wikimedia enrichment, wikithons
- Digital storytelling
- Digital annotation
- Use of metadata as research data
- Opportunities and limits of working with fuzzy data
- Interactive interfaces
- History of intellectual property and early photography
We are also interested in collecting examples of digital explorations and welcome submissions of ideas to enrich the portfolio of research examples on GitHub.
If you have ideas to share, or are interested in exploring collaborative research, please do get in touch contacting Katherine Howells, Giorgia Tolfo or research@nationalarchives.gov.uk