In October of last year, I began my PhD journey with The National Archives and King’s College London, with a project exploring how search tools can better support access to digital case law collections. Drawing on my background in linguistics, I have been particularly interested in how language itself can be a barrier to access.
It is no surprise to anyone who has worked with the law or read legal documents that the language used in court documents can differ significantly from the language we use on the day to day. Whether it is the complex grammar and sentence structures found in legislation, or the particular vocabulary and jargon used by judges, there are many ways in which it can be difficult for people without legal training to understand the language of legal documents.
This is where we propose the idea of an accessible search function that can understand the nuances of language and overall meaning and intent behind a search query: semantic search.
Semantic Search: From Keywords to Concepts
Currently, the Find Case Law website uses a keyword-based search function. This means that the system finds exact matches to the search query in the source documents. This type of design assumes not only that users know what to search for, but also that their search terms they will align with the language used by judges.
This limitation of keyword-based search engines is not limited to legal repositories like Find Case Law. Many other digital collections, particularly within the cultural heritage sector, which contain documents with nuanced, jargon-heavy, or outdated language can be difficult for people without area specific knowledge to search.
The answer to this is a semantic search engine. This technology differs from keywords in that before directly matching the search term to documents in its collection, the system attempts to understand the meaning and intent behind the search query. Semantic search engines use a variety of tools to achieve this, such as lemmatisation (reducing words to their root form), entity recognition (e.g., recognising 'Supreme Court' as a legal body, not just two separate words), and disambiguation (distinguishing between 'Apple' the company and the fruit). Crucially, they also detect synonyms and related terms, like 'vehicle' and 'car' or 'damages' and 'compensation.'

A search for 'apple fruit' on Find Case Law reveals the limitations of keyword-based searches.
What we built and why we started small
With the first year of this project coming to an end, we are excited to share our first prototype: a lightweight synonym-matching tool designed to surface legal terminology related to everyday search terms. For instance, a user searching for 'divorce' may be shown legal alternatives like '[decree] nisi,' or searching for 'compensation' will suggest terms like 'damages,' or 'reparation'. In short, terms that are semantically similar to the query term but that are more likely to appear in a legal context.
The tool works by matching custom-trained word2vec Word Embeddings from the Find Case Law corpus to those trained on the British National Corpus, which we use as our reference corpus of standard British English
What are Word Embeddings?
Word Embeddings are a type of language model that turns words into numerical vectors, allowing systems to calculate how similar two words are, even if they aren’t the same. This means a search tool can understand that 'termination' and 'dismissal' are conceptually related, even if they don’t share exact wording. There are a number of embeddings models on the market such as word2vec and GloVe.
Why work with the seemingly simple and outdated technologies of Word Embeddings? The aim of this first prototype was to test whether we could build something lightweight and simple that could surface semantic similarities across relatively small corpora. Currently, the Find Case Law corpus contains around 250 million tokens. This is a modest but ideal size for a word2vec model, which was initially tested on a corpus of 320M tokens.
Word Embeddings are thus much more suited for this type of lightweight task than the more recent and heavy-weight transformer-based models like BERT[i] which typically require vocabularies of multiple billion tokens for effective training and fine-tuning.

A visualisation the embeddings of the top 50 words in Find Case Law. This allows us to visualise semantic similarity between words in vector space. The closer two vectors are to one another, the more semantically similar those words are.
What’s next? Prototype to practical tool
As I enter the second year of the project, before we tackle the issue of moving to transformer-based technologies, the next major milestone will be user testing. This will help us understand whether the prototype meets the needs of users, in particular those without a legal background.
Insights from this testing will inform the development and build of the second prototype in which we hope to move to more robust transformer-based models to capture even more nuances into the semantic similarities and differences between legal and everyday language.
As it stands, the prototype is password protected in order to avoid confusion with the existing Find Case Law webpage. This version is purely experimental – the goal is to test ideas rather than replace existing infrastructure. More information on this and all relevant metadata for this version can be found on the Oxford Text Archive.

The semantic search prototype, showing alternative terms for the query ’stalking’, such as 'harassment', 'harass' and 'restraining'.
This project has been an exciting exercise in interdisciplinary collaboration. Drawing on expertise in law, archiving, linguistics, and computer science, we aim to demonstrate that semantic search is not only feasible for small digital collections, but essential to their accessibility.
I look forward to sharing more as we develop future iterations of the tool and continue bridging the gap between specialist legal language and everyday English.
Further reading
- Wilson, Caitlin, Barbara McGillivray, Nicola Welch, and Marton Ribary. 2025. ‘Enhancing Legal Search with Word Embeddings in The National Archives’ Find Case Law Service.’ Paper presented at the UK-Ireland Digital Humanities Association Annual Conference, June 17–18, 2025.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. ‘Efficient Estimation of Word Representations in Vector Space’. arXiv.
- Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. ‘Glove: Global Vectors for Word Representation’. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. Proceedings of naacL-HLT 1.
About the author
Caitlin Wilson is a first-year PhD student at King’s College London (supervisor: Dr Barbara McGillivray). Her project ‘Lost for Words: semantic search in The National Archives’ Find Case Law service’ is a collaborative doctoral award with The National Archives and funded by the London Arts and Humanities Partnership.