The National Archives
Search The National Archives
Advanced search
The cyber cafe

The cyber cafe

Researching the semantic web

A large proportion of the World Wide Web that we see today is classed as the 'syntactic' web, and has developed around the notion of sharing documents. 

The underlying functionality is that of displaying characters, shapes and colours in locations on a Web page, with links to other documents. It helps you connect to the information you are seeking, but the human, in interpreting the information on the page, does the majority of the work.

What is the semantic web?

Put at its simplest, the semantic web is the web of data.

More formally according to the World Wide Web Consortium (1), the semantic web is about two things:

  • Common formats for integration and combination of data drawn from diverse sources, where the original Web mainly concentrated on the interchange of documents.
  • Language for recording how the data relates to real world objects that allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing.

The semantic web is about creating a web of linked data based on meaning or semantics (hence the name),  where data is increasingly independent from any one application/document.  Identifiers, called URIs (2), are attached to individual datum, enabling links between relevant information to be defined.  This helps computers to do more 'intelligent' processing, which in turn can help users to find other relevant information more effectively.

As Tim Berners-Lee, inventor of the World Wide Web puts it:

"The semantic web isn't just about putting data on the web.  It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data."

The semantic web will evolve over time.  It won't replace the web of documents we have now, but grow beside it.  We have the technical standards we need in order to realise the semantic web.  The next stages are to encourage their widespread adoption.

Enabling re-use by embedding semantics in web pages

With new semantic web standards it is now possible to mark-up textual information inside web pages, blurring the distinction between documents and data on the web.  The National Archives has been exploring the use of semantic markup inside XHTML (3) documents, using a new language called RDFa (4) in order to facilitate access, use and re-use of public sector information.

Apart from the data held in databases much of the interesting information that people want to re-use is semi-structured.  The creation and dissemination of such semi-structured information is widespread throughout the public sector.  A notable example is the Government's official journal, The London Gazette.

An important part of The National Archives' online strategy is to 'enable'. With The London Gazette this means releasing the information in a re-usable form so that others can be creative in developing or extending new services to their audiences.  It is here that so-called 'semantic' technologies can assist, as they can facilitate data having its own identity, independent from the application or publication in which it was originally created or featured.  This results in linkages both back to the original sources of the data and therefore more easily discovered, as well as links to be planted in future documents for the same reason.

This work lays the foundation for the development of a new official publishing strategy for the government. Anytime legislation says that information must be published in the London Gazette, it will in effect be ensuring that information is made publicly available, in a consistent way and in a reusable form.

Our experience as one of the first adopters of RDFa has been that adding semantics to an existing website is not as straight-forward as might be hoped.  The particular challenges we are exploring are:

  • Overcoming deficiencies of an original web address (URI) design, by creating a new URI scheme to provide identifiers that could be used in RDF triples (5). This scheme then has to be integrated into the existing website
  • Ensuring that the content, which is sourced from multiple places, results in valid XHTML with embedded RDFa
  • Finding reliable methods of retrospective automated markup of the notices

These issues, and the methods used to overcome them, have been outlined in a paper presented at the XTech Conference this year (6). This work is proving crucial to informing the Government's more general web strategy, as the Power of Information agenda pushes re-use ever higher in terms of priority.


References:
1 World Wide Web Consortium (W3C) is an international standards organisation that make decisions on Web standards.

2 URI (Universal Resource Indicator) simply a full Web address eg 'http://en.wikipedia.org/wiki/URL'. This enables a Web page to be located

3 A combination of HTML (Hyper Text Markup Language) standard language that most Web pages are written in and XML (eXtensible Markup Language) that facilitates the sharing of structured data across different information systems.

4  RDFa (Resource Description Framework attributes) a set of extensions to XHTML that allow you to annotate mark up with semantics.

5 'Triples' in RDF consist of three sequential elements that define a relationship between two objects, eg man (subject) - loves (predicate) - dog (object).

6 Tennison J & Sheridan J (2008), SemWebbing The London Gazette, XTExh 2008(http://2008.xtech.org/public/schedule/detail/528).