Glossary of web archiving terms

A

Access – In an archived website, being able to navigate to or view a captured document or page.

Ajax – Javascript method of exchanging data with a server.

ARC / WARC file – The storage file formats used in the web archive.

B

Backup (web) – An exact copy of some or all of a website’s pages and resources (and in some cases, functionality) in the same formats as live which can be used to fully restore a website in the event of problems or deletion.

C

Capture/Crawl (or harvest, snapshot, archive) – The process of copying digital information from the live web to the web archive using a crawler.

Content freeze (or publishing freeze) – Stopping any changes and updates being made to a website for a limited period of time.

Crawler (or spider, robot, scraper) – An internet bot (small piece of software) that the UKGWA instructs to crawl target domains. It follows links and captures information it finds on them.

Crawler trap – A feature of a website that reduces a crawler’s ability to follow links and gather information. Traps can be intentional (to fight spambots) or unintentional poor web design.

Cybersquatting – When ownership of a web domain lapses and it is purchased and used by a different organisation. Cybersquatting can be malicious (financial gain or fraud) or simply a potential confusion for visitors.

D

Datestamp (or timestamp)– The configuration of a URL to contain a datestamp in the format: YYYYMMDDHHMMSS, identifying the date and time when the resource was captured from the live web. For example, this resources was captured at 22:40:31 on 5 November 1996: https://webarchive.nationalarchives.gov.uk/ukgwa/19961105224031/https://www.hm-treasury.gov.uk:80/.

Domain (or target domain, target website) – The website that is targeted for crawling.

Dynamic content (also dynamically generated content) – Website content that constantly or regularly changes based on user interactions, timing and other parameters that determine what content is delivered to the user.

E

Exceptional crawl – A crawl outside of the regular scheduled crawl.

H

Hosting – Allocated space on a web server that allows websites, documents and applications to be stored and accessed.

I

Index page (or Timeline page) – A webpage which provides links to all captures of a resource held in the archive.

J

Javascript – A computer programming language commonly used to create interactive effects within web browsers.

M

Microsite (or subsite) – A smaller site hosted within or as part of a larger site. Usually located as a suffix directory to the larger site’s domain name, for example: https://domain.gov.uk/microsite/ but can also be located as a sub-domain.

MirrorWeb – The service provider which carries out the technical web archiving process and hosts web archive data on behalf of The National Archives.

P

Partial crawl – A crawl of limited parts of a website.

Patching – Re-sending a crawler to capture resources missed in the initial crawl. Part of QA.

POST – Javascript method to request data from a server.

Public domain – Materials that are not protected by intellectual property laws such as copyright, trademark, or patent laws. Belong to the public, cannot be owned by any individual or organisation.

Published – Available to the public through the UK Government Web Archive at a permanent URL.

Q

Quality Assurance (QA) – A check for accuracy and completeness. Uses a combination of automatic and manual tools and techniques.

R

Remote Harvesting – The act of harvesting content from any location across the web from any geographical location in the world.

Resource (web) – Any file which is part of a website.

robots.txt (or robots exclusion standard, robots exclusion protocol) – A small file on a website which tells web crawlers and other web robots how the site should be crawled. Usually located at root folder level, for example: https://www.sitename.gov.uk/robots.txt.

Root (URL) – The start or source of all other pages on a website. Most often the homepage, for example: https://www.gov.uk/.

S

Scheduled Crawl – A crawl of a website performed at regular intervals.

Seed (URL) – A URL the crawler is instructed to capture.

Sitemap – A list of pages within a website.

Static IP address – A static Internet Protocol (IP) address is a permanent number assigned to a computer.

Sub-domain – Often used for microsites or subsites. A directory prefixing the main domain name, for example: https://subdomain.domain.gov.uk/.

Supplementary URL list – A text file listing URLs on a website, especially those that are difficult for crawlers to reach or that are not hosted on the target domain. An alternative to an XML sitemap.

T

Takedown – The process of removing public access to resources hosted in the web archive.

U

URL/URLs – Stands for “Uniform Resource Locator”. The address of a web resource, for example: https://www.cps.gov.uk/ or https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1066250/covid-private-testing-providers-general-testing-5-4-22.csv.

User-agent – string Line of text information about which browser is being used, what version, and on which operating system.

X

XML Sitemap – A file that lists the pages within a website in XML (eXtensible Markup Language) format to help crawlers index them.