If your website is compliant with all of the following requirements, it will be much easier to successfully archive it intact.
This guidance is designed for website developers. If you do not manage your website’s technical set-up, you will need to pass this information onto your web developer. You can provide them with a link to this page and a copy of our archive compliance developer checklist.
Preparing for an archive crawl
Read through the full guidance here to identify any areas that are likely to cause issues, and try to fix them in advance where you can. If you want to run a very basic check to see how compatible with archiving your website is, you can use the Archive Ready tool.
Your site must allow access to MirrorWeb’s crawlers, which will ignore robots.txt. Blocking or slowing down our crawlers with anti-robot technology will mean that the website cannot be successfully archived. Our crawlers can be identified through a specific user-agent string or by using a static IP Address which our team can provide for you.
If there is any content on your site that cannot be captured through web crawling (see limitations), please make sure you tell your own information management team so they can assess whether it should be preserved through other methods as part of the public record.
HTML versions and HTTP protocols
Video, infographic, audio and multimedia content
Document and file sharing services
Site structure and sitemaps
Links and URLS
Dynamically-generated content and scripts
Interactive graphs, maps and charts
Menus, search and forms
Database and lookup functions
POST requests and Ajax
W3C compliance
Copyright statements and licencing
Website backups (as files)
Domain management for closing websites
Intranet and secured content areas
HTML versions and HTTP protocols
We can archive and replay all versions of HTML to date. Present everything on your website using either the HTTP or HTTPS protocol.
Video, infographic, audio and multimedia content
- Streaming video or audio cannot be captured and should also be made accessible via progressive download, over HTTP or HTTPS, using absolute URLs, where the source URL is not obscured.
- Link to audio-visual (AV) material with absolute URLs instead of relative URLs.
For example: https://www.mydomain.gov.uk/video/video1.mp4 rather than …video/video1.mp4 or video/video1.mp4
- Provide transcripts for all audio and video content.
- Provide alternative methods of accessing information held in infographics, videos and animations.
- We cannot usually capture content which is protected by a cross domain file – this is most often multimedia content which is embedded in web pages but hosted on another domain. If this is the case for any of your content, please ensure that it is made available to the crawler in an alternative way.
Documents and file sharing
- We cannot capture file content hosted on web-based collaborative platforms and file sharing services such as Sharepoint, Google Docs and Box. You should make these files available in ways which are accessible to the crawler – for example as downloadable files hosted on the root domain. If you have files hosted on collaborative platforms, please make sure you tell us about them as soon as possible so that we can advise you.
Site structure and sitemaps
- Include a human-readable HTML sitemap in your website. It makes content more accessible, especially when users are accessing the archived version, as it provides an alternative to interactive functionality.
- Have an XML sitemap. This really speeds up our ability to capture and quality assure our website archives.
- Where possible, keep all content under one root URL. Any content hosted under root URLs other than the target domain, sub-domain or microsite is unlikely to be captured. Typical examples of this include documents hosted in the cloud (such as amazonaws.com), newsletters hosted by services such as Mailchimp and when any services that link through external domains are used. If this is not possible, you will need to provide a list of links as an XML sitemap or supplementary URL list before the crawl is launched.
- If using pagination (../page1 , ../page2 and so on), you will also need to include all URLs from that pagination structure in your browse or XML sitemap as the crawler can sometimes misinterpret recurrences of a similar pattern as a crawler trap and therefore may only crawl to a limited depth.
Links and URLS
- Internal links, other than those for AV content, should be relative and should not use absolute paths. This means that the links are based on the structure of the individual website, and do not depend on accessing an external location where files are held. For example: <img src=”/images/1.jpg” alt=”logo”/> or <img src=”images/image1.jpg” alt=” logo”/> rather than <img src=”https://www.anysite.gov.uk/images/image1.jpg” alt=”welcome logo”/>.
- ‘Orphaned’ content (content that is not linked to from within your website) will not be captured. You will need to provide a list of orphan links as an XML sitemap or supplementary URL list before the crawl is launched.
- Links in binary files attached to websites (links included in .pdf, .doc, .docx, .xls, .xlsx, .csv documents) cannot be captured. All resources linked to in these files must also be linked to on simple web pages or you will need to provide a list of these links as an XML sitemap or supplementary URL list before the crawl is launched.
- Where possible, use meaningful URLs such as https://mywebsite.com/news/new-reportlaunch rather than https://mywebsite.com/5lt35hwl. As well as being good practice, this can help when you need to redirect users to the web archive.
- Avoid using dynamically-generated URLs.
- There are other ways to redirect users to content in the web archive, as outlined by the Government Digital Service.
Dynamically-generated content and scripts
- Client-side scripts should only be used if it is determined that they are most appropriate for their intended purpose.
- Make sure any client-side scripting is publicly viewable over the internet – do not use encryption to hide the script.
- As much as you can, make sure your code is maintained in readily-accessible separate script files (example: with .js extension) rather than coded directly into content pages, as this will help diagnose and fix problems.
- Avoid using dynamically-generated date functions. Use the server-generated date, rather than the client-side date. Any dynamically-generated date shown in an archived website will always display today’s date.
- Avoid using dynamically-generated URLs.
- Dynamically-generated page content using client-side scripting cannot be captured. This may affect the archiving of websites constructed in this way. Wherever possible the page design should make sure that content is still readable and links can still be followed by using the <noscript> element.
- When using JavaScript to design and build, follow a ‘progressive enhancement’ approach. This works by building your website in layers:
- Code semantic, standards-compliant (X)HTML or HTML5
- Add a presentation layer using CSS
- Add rich user interactions with JavaScript
- This is an example of a complex combination of JavaScript, which will cause problems for archive crawlers, search engines, and some users: javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)
- This is a preferred example of a well designed URL scheme with simple links: <a href=”content/page1.htm”
onclick=”javascript:__doPostBack(‘ctl00$ContentPlaceHolder1$gvSectionItems’,’Page$1′)
“>1</a> - Always design for browsers that don’t support JavaScript or have disabled JavaScript.
- Provide alternative methods of access to content, such as plain HTML.
Interactive graphs, maps and charts
- Avoid interactive content where possible as we often have difficulty archiving these resources and retaining their functionality.
- If you have vital interactive graphs, maps or charts please let us know as we may be able to attempt to capture them using experimental technology.
- In all cases, if content of these types cannot be avoided, please provide alternative ‘crawler friendly’ methods for accessing and displaying it. Where visualisations are used, the underlying data itself should always be accessible in as simple a way as possible – for example in a .txt or .csv file.
Menus, search and forms
- Use static links, link lists and basic page anchors for menus and navigation elements, rather than using JavaScript and dynamically generated URLs.
- Any function that requires a ‘submit’ operation, such as dropdown menus, forms, search and checkboxes, will not archive well. Always provide an alternative method to access this content wherever possible, and make sure you provide a list of links that are difficult to reach as an XML sitemap or supplementary URL list before the crawl is launched.
Database and lookup functions
- If your site uses databases to support its functions, these can only be captured in a limited fashion. We can capture snapshots of database-driven pages if these can be retrieved via a query string, but cannot capture the database used to power the pages.
For example, we should be able to capture the content generated at https://www.mydepartment.gov.uk/mypage.aspx?id=12345&d=true since the page will be dynamically generated when the web crawler requests it, just as it would be for a standard user request. This works where the data is retrieved using a HTTP GET request as in the above example.
POST requests and Ajax
- We can’t archive content that relies on HTTP POST requests, since no query string is generated. Using POST parameters is fine for certain situations such as search queries, but you must make sure that the content is also accessible via a query string URL that is visible to the crawler, otherwise it will not be captured.
- We are unlikely to be able to capture and replay any content which use HTTP POST requests, Ajax or similar.
- Always provide an alternative method to access this content wherever possible, and make sure you provide a list of links that are difficult to reach as an XML sitemap or supplementary URL list before the crawl is launched.
W3C compliance
In most cases, a website that has been designed to be W3C Web Accessible should also be easy to archive.
Always use simple, standard web techniques when building a website. There are few limits to a website builder’s creativity when using the standard World Wide Web Consortium (WC3) recommendations. Using overly complex and non-standard website design increases the likelihood that there will be problems for users, for web archiving, and for search engine indexing.
Copyright statements and licencing
All content stored through the web archive must either be Crown copyright or have appropriate licences are in place with third party copyright holders to allow us to copy it and make it available.
Your website must have a clear copyright statement as this will make it clear to future users who owns the copyright and under what terms it may be reused under the Open Government Licence. This applies to all content on your website. Any media or copy that is copyright to a third party must be clearly marked.
The Government Digital Service has produced image copyright guidance and you should also read our statement regarding re-use of the information contained within our archive. You must inform us of any content on their website does not comply with this guidance.
Website backups (as files)
We cannot accept ‘dumps’ or ‘backups’ of websites from content management systems, databases, on hard drives, CDs or DVDs or any other external media into the archive. Only snapshots directly crawled by our system are accepted.
Domain management for closing websites
You need to retain ownership of a closing website’s domain after the final crawl of the website has been made, in order to:
- Prevent cybersquatting.
- Allow redirects to the web archive for user reference and web continuity.
Intranet and secured content areas
We cannot archive content that is protected behind a login, even if you provide us with the login details.
If content is hosted behind a login because it is not appropriate for it to be publicly accessible; it should be managed there until its sensitivity falls away and it can then be published to the open website, or you can liaise with your information management team about whether it should be preserved through other methods as part of the public record.