A
Access – In an archived website, being able to navigate to or view a captured document or page.
Ajax – Javascript method of exchanging data with a server.
ARC / WARC file – The storage file formats used in the web archive.
B
Backup (web) – An exact copy of some or all of a website’s pages and resources (and in some cases, functionality) in the same formats as live which can be used to fully restore a website in the event of problems or deletion.
C
Capture/Crawl (or harvest, snapshot, archive) – The process of copying digital information from the live web to the web archive using a crawler.
Content freeze (or publishing freeze) – Stopping any changes and updates being made to a website for a limited period of time.
Crawler (or spider, robot, scraper) – An internet bot (small piece of software) that the UKGWA instructs to crawl target domains. It follows links and captures information it finds on them.
Crawler trap – A feature of a website that reduces a crawler’s ability to follow links and gather information. Traps can be intentional (to fight spambots) or unintentional poor web design.
Cybersquatting – When ownership of a web domain lapses and it is purchased and used by a different organisation. Cybersquatting can be malicious (financial gain or fraud) or simply a potential confusion for visitors.
D
Datestamp (or timestamp)– The configuration of a URL to contain a datestamp in the format: YYYYMMDDHHMMSS, identifying the date and time when the resource was captured from the live web. For example, this resources was captured at 22:40:31 on 5 November 1996: https://webarchive.nationalarchives.gov.uk/ukgwa/19961105224031/https://www.hm-treasury.gov.uk:80/.
Domain (or target domain, target website) – The website that is targeted for crawling.
Dynamic content (also dynamically generated content) – Website content that constantly or regularly changes based on user interactions, timing and other parameters that determine what content is delivered to the user.
E
Exceptional crawl – A crawl outside of the regular scheduled crawl.
H
Hosting – Allocated space on a web server that allows websites, documents and applications to be stored and accessed.
I
Index page (or Timeline page) – A webpage which provides links to all captures of a resource held in the archive.
J
Javascript – A computer programming language commonly used to create interactive effects within web browsers.
M
Microsite (or subsite) – A smaller site hosted within or as part of a larger site. Usually located as a suffix directory to the larger site’s domain name, for example: https://domain.gov.uk/microsite/ but can also be located as a sub-domain.
MirrorWeb – The service provider which carries out the technical web archiving process and hosts web archive data on behalf of The National Archives.
P
Partial crawl – A crawl of limited parts of a website.
Patching – Re-sending a crawler to capture resources missed in the initial crawl. Part of QA.
POST – Javascript method to request data from a server.
Public domain – Materials that are not protected by intellectual property laws such as copyright, trademark, or patent laws. Belong to the public, cannot be owned by any individual or organisation.
Published – Available to the public through the UK Government Web Archive at a permanent URL.
Q
Quality Assurance (QA) – A check for accuracy and completeness. Uses a combination of automatic and manual tools and techniques.
R
Remote Harvesting – The act of harvesting content from any location across the web from any geographical location in the world.
Resource (web) – Any file which is part of a website.
robots.txt (or robots exclusion standard, robots exclusion protocol) – A small file on a website which tells web crawlers and other web robots how the site should be crawled. Usually located at root folder level, for example: https://www.sitename.gov.uk/robots.txt.
Root (URL) – The start or source of all other pages on a website. Most often the homepage, for example: https://www.gov.uk/.
S
Scheduled Crawl – A crawl of a website performed at regular intervals.
Seed (URL) – A URL the crawler is instructed to capture.
Sitemap – A list of pages within a website.
Static IP address – A static Internet Protocol (IP) address is a permanent number assigned to a computer.
Sub-domain – Often used for microsites or subsites. A directory prefixing the main domain name, for example: https://subdomain.domain.gov.uk/.
Supplementary URL list – A text file listing URLs on a website, especially those that are difficult for crawlers to reach or that are not hosted on the target domain. An alternative to an XML sitemap.
T
Takedown – The process of removing public access to resources hosted in the web archive.
U
URL/URLs – Stands for “Uniform Resource Locator”. The address of a web resource, for example: https://www.cps.gov.uk/ or https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1066250/covid-private-testing-providers-general-testing-5-4-22.csv.
User-agent – string Line of text information about which browser is being used, what version, and on which operating system.
X
XML Sitemap – A file that lists the pages within a website in XML (eXtensible Markup Language) format to help crawlers index them.