Glossary of web archiving terms


Access – In an archived website, being able to navigate to or view a captured document or page.

Ajax – Javascript method of exchanging data with a server.

ARC / WARC file – The storage file formats used in the web archive.


Backup (web) – An exact copy of some or all of a website’s pages and resources (and in some cases, functionality) in the same formats as live which can be used to fully restore a website in the event of problems or deletion.


Capture/Crawl (or harvest, snapshot, archive) – The process of copying digital information from the live web to the web archive using a crawler.

Content freeze (or publishing freeze) – Stopping any changes and updates being made to a website for a limited period of time.

Crawler (or spider, robot, scraper) – An internet bot (small piece of software) that the UKGWA instructs to crawl target domains. It follows links and captures information it finds on them.

Crawler trap – A feature of a website that reduces a crawler’s ability to follow links and gather information. Traps can be intentional (to fight spambots) or unintentional poor web design.

Cybersquatting – When ownership of a web domain lapses and it is purchased and used by a different organisation. Cybersquatting can be malicious (financial gain or fraud) or simply a potential confusion for visitors.


Datestamp (or timestamp)– The configuration of a URL to contain a datestamp in the format: YYYYMMDDHHMMSS, identifying the date and time when the resource was captured from the live web. For example, this resources was captured at 22:40:31 on 5 November 1996:

Domain (or target domain, target website) – The website that is targeted for crawling.

Dynamic content (also dynamically generated content) – Website content that constantly or regularly changes based on user interactions, timing and other parameters that determine what content is delivered to the user.


Exceptional crawl – A crawl outside of the regular scheduled crawl.


Hosting – Allocated space on a web server that allows websites, documents and applications to be stored and accessed.


Index page (or Timeline page) – A webpage which provides links to all captures of a resource held in the archive.


Javascript – A computer programming language commonly used to create interactive effects within web browsers.


Microsite (or subsite) – A smaller site hosted within or as part of a larger site. Usually located as a suffix directory to the larger site’s domain name, for example: but can also be located as a sub-domain.

MirrorWeb – The service provider which carries out the technical web archiving process and hosts web archive data on behalf of The National Archives.


Partial crawl – A crawl of limited parts of a website.

Patching – Re-sending a crawler to capture resources missed in the initial crawl. Part of QA.

POST – Javascript method to request data from a server.

Public domain – Materials that are not protected by intellectual property laws such as copyright, trademark, or patent laws. Belong to the public, cannot be owned by any individual or organisation.

Published – Available to the public through the UK Government Web Archive at a permanent URL.


Quality Assurance (QA) – A check for accuracy and completeness. Uses a combination of automatic and manual tools and techniques.


Remote Harvesting – The act of harvesting content from any location across the web from any geographical location in the world.

Resource (web) – Any file which is part of a website.

robots.txt (or robots exclusion standard, robots exclusion protocol) – A small file on a website which tells web crawlers and other web robots how the site should be crawled. Usually located at root folder level, for example:

Root (URL) – The start or source of all other pages on a website. Most often the homepage, for example:


Scheduled Crawl – A crawl of a website performed at regular intervals.

Seed (URL) – A URL the crawler is instructed to capture.

Sitemap – A list of pages within a website.

Static IP address – A static Internet Protocol (IP) address is a permanent number assigned to a computer.

Sub-domain – Often used for microsites or subsites. A directory prefixing the main domain name, for example:

Supplementary URL list – A text file listing URLs on a website, especially those that are difficult for crawlers to reach or that are not hosted on the target domain. An alternative to an XML sitemap.


Takedown – The process of removing public access to resources hosted in the web archive.


URL/URLs – Stands for “Uniform Resource Locator”. The address of a web resource, for example: or

User-agent – string Line of text information about which browser is being used, what version, and on which operating system.


XML Sitemap – A file that lists the pages within a website in XML (eXtensible Markup Language) format to help crawlers index them.