About | Project Members | Research Assistants | Contact | Posting FAQ | Credits

Semantic Web

Research Report by Angus Forbes
(created 10/6/06; version 1.0)
[Status: Draft]

Related Categories: Software / Coding Innovations, Search and Data Mining Innovations

Original Object for Study description

The World Wide Web Consortium (W3C) uses the term “the semantic web” as an umbrella identifier to refer to a number of initiatives that enable developers and archivists to add rich, meaningful metadata to digital resources. According to the W3C, the major reason for theses initiatives to tag information with explicit meaning is to make “it easier for machines to automatically process and integrate information” [1]. The semantic web adds depth to the existing web protocols running over the application layer of the internet without involving any changes to its more basic architecture. Currently, the main feature that organizes the web is the “link”—any document (or resource) can link to any other. Additionally, each link is coupled with a method (or protocol) to present the resource to the user or application that followed that link (e.g., by clicking on it). That is, the web in one sense is completely non-hierarchical and unstructured. The only structural meaning of links between two web pages (or other resources) is simply that one of them refers to the other (and possibly vice versa); all other meanings are entirely contextual and must be interpreted by humans. The goal of the semantic web is to provide a richer structure of relationships to define formally some of the meanings that link resources. And in particular, to provide an extensible uniform structure that can be easily interpreted by search engines and other software tools. The W3C describes a number of potential practical applications that could make use of semantic web technologies, including enhanced search engines for multimedia collections, automated categorization, intelligent agents, web service discovery, and content mapping between disparate electronic resources [2].

Much of the work on the semantic web is in fact syntactic in nature. The W3C presents a series of syntaxes for the semantic web, including the Resource Description Framework (RDF) and the Web Ontology Language (OWL). The basic structure of a semantic relationship is commonly represented by a conceptual graph of RDF descriptors embedded within XML tags (XML is described in a separate Transliteracies report). The RDF language defines a simple data structure (called a “triple”) made up of a subject, an object, and a relationship between the subject and the object. The subject element (a document, web page, file, etc.) is generally considered a “resource,” denoted with a URI (Universal Resource Identifier). The relationship element too is denoted by a URI pointing to a definition of the relationship, which avoids ambiguity by software tools and encourages standardization for common relationships within particular contexts. The object could be yet another URI, or perhaps plain text. Note that the relationship between a subject and an object is not necessarily reversible. The W3C’s RDF Primer describes a simple (if verbose) example of RDF nodes providing more detailed structure to a web page:

< rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”xmlns:dc=”http://purl.org/dc/elements/1.1/” xmlns:exterms=”http://www.example.org/terms/” >
          < rdf:Description rdf:about=”http://www.example.org/index.html” >
                    < exterms:creation-date>August 16, 1999< /exterms:creation-date >
                    < dc:language>en< /dc:language>
                    < dc:creator
rdf:resource=”http://www.example.org/staffid/85740”/ >
          < /rdf:Description >
< /rdf:RDF >

In the above example the resource for the index page to the www.example.org site contains metadata about the date of the web page’s creation, the language it was written in, and the person who created it. The first line points to the definition of the RDF, as well as to well-defined terms not included in the original specification that describe a relationship. The tag beginning with “exterms” uses a term “creation-date” to provide some temporal context for the page. The tags beginning with “dc” indicate a well-defined term in the RDF language. The tag “dc:creator” points to another resource. This resource is in fact a unique identifier for the actual staff member who created the page. Thus the “triple” for that information is subject:a person, object:the web page, and relationship:creator [3].

In computer science, an organizational structure of relationships is called an “ontology.” The Web Ontology Language (OWL) attempts to define a formal semantics in which to classify resources as objects of a particular type (or multiple types) with various properties. OWL is similar to RDF but more explicitly provides a flexible “vocabulary” for describing relationships between resources than RDF does, including various terms from set theory that indicate a level of membership in a particular category. OWL makes a conceptual distinction between a “class” and an “individual.” According to the W3C, “classes should correspond to naturally occurring sets of things in a domain of discourse, and individuals should correspond to actual entities that can be grouped into these classes” [4]. That is, OWL can be used to model information in the same way that humans do intuitively. This information can be of any kind: a concept, an item, a relationship (or lack of one). Furthermore, the information can be nested in such a way that a resource is both a member of a class and a class to other individuals. This potentially can cause problems, such as the inadvertent creation of infinite loops. For instance, A can refer to B, which refers to a, ad infinitum. Because of this issue, the W3C has proposed three different subsets of OWL specification with increasing complexity, including OWL Lite, OWL Description Logics, and OWL Full. For the full range of relationship mapping, precautions need to be made to avoid looping. The OWL language provides a very generic syntax from which any number of classes and concepts can be defined. The fundamental goal of the OWL language is to enable different communities using particular digital resources to create their own ontologies and provide a mechanism to interact with other communities. Creating an ontology, however, is rather involved, and much of that effort is “devoted to hooking together classes and properties in ways that maximize implications” [5].

A more complete discussion on the technologies associated with the semantic web can be found at the W3C website [6], in a series of IETF (Internet Engineering Task Force) RFCs (Request for Comments) [7]or on Wikipedia [8]. Many of the details of the specifications are still being worked out.

A criticism of at the W3C’s proposals is that it will be difficult to convince developers to add further layers of complexity to current internet documents. Moreover, it will be almost impossible to standardize the types of metadata that are used to catalog relationships [9],[10].There is an interesting tension between the “chaos” of the internet and the desire to create new technologies that provide some types of order or cataloging. Most likely the very reason for the ubiquity of the web is because of the flimsiness of the relationships.

Defenders of the proposed specifications counter that after a simple shift in perspective—thinking of web pages in terms of resources that are defined by relationships—is straightforward and that the benefits outweigh the perceived complexity [11]. Secondly, while OWL does standardize a set of tools to encode relationships, it does not actually attempt to encode the relationships themselves. Rather, additional ontologies are meant to be created and shared among different groups of users. This brings up another question—since anyone can create ontologies, and since these ontologies in some sense define meaning, who will have authority and responsibility for the creation of meaning? The W3C avoids the issue, simply expecting ontologies to adapt and evolve. Just as different entities compete to be placed atop a keyword search on Google, the adoption of the Semantic Web would lead to similar competition over which ontologies were appropriate. Perhaps there will be a need for meta-ontologies defining which ontologies were appropriate for particular situations.

In the last century, many people have conceived of ideas involving networks of information. Vannevar Bush’s “As We May Think” (1945) is regarded as a very influential paper. Among other things, he describes a machine (“the Memex”) which would be able to follow “associative trails” of information stored on microfilm. However, he does not outline any specific practical ideas which resemble the structure of the internet.

Douglas Engelbart, the inventor of the mouse, realized that links in electronic documents could contain more information than simply pointing to other documents. For instance, in “Augmenting Human Intelligence” (1962), he describes “capability hierarchies” where a program or human user can “specify different display or manipulative treatment for the different types [of links]”. A prototype of some of the ideas outlined in this paper have been implemented at HyperScope.org. This prototype, however, consists mostly of constructing a detailed index to an electronic document such that that a user is able to specify the level of detail. It also includes a pop-up box where a user can jump to different indexes of the document more quickly (provided that he knows the indexing scheme).

Ted Nelson, a pioneer in information organization, coined the word “hypertext” in 1963 and published an influential book called Computer Lib/Dream Machines in 1974, which explores interconnected information. Ted Nelson’s work is often cited as a precursor to much of the work done by Tim Berners-Lee and the W3C. In particular, many aspects of the semantic web echo ideas in his prototype project Xanadu, which define a field he names “TransLiterature.” In his online paper, “Xanalogical Structure, Needed Now More than Ever: Parallel Documents, Deep Links to Content, Deep Versioning and Deep Re-Use,” he positions his projects explicitly against the internet as it now exists:

The World Wide Web was not what we were working toward, it was what we were trying to prevent. The Web displaced our principled model with something far more raw, chaotic and short-sighted [12].

He instead envisions an electronic medium that can “represent digitally the literary forms of connection which could not be represented before.” Important concepts to the Xanadu project are “transpointing windows” and “transclusion.” Transpointing windows are windows that highlight differences in different versions (“origin beams”) of a document. The related concept of “transclusion” involves nesting copies of one document inside another, maintaining a pointer to the original document so that “literary” analysis can be performed throughout the different levels of editing and recomposition [13]. That is, rather than “linking” to another document, the new document actually contains “the same thing knowably and visibly in more than once place.”

However, Nelson’s work now seems to be somewhat outdated and impractical. He also takes credit for relatively simple ideas which have been actually implemented by a host of other people in various forms. For example, “transpointing windows” software could be said to be a simplification of fuctional versioning software, such as the popular CVS (Concurrent Versions System), or the editing tools in MediaWiki’s Wikipedia, or even the unix shell command “diff.” The concept of “transculsion” is somewhat similar to the W3C proposal to have digital resources be made up of a series of URIs to other resources, minus the explicit metadata meant to encode relationships.

Although ideas about interconnected documents had existed for some time, Tim Berners-Lee synthesized the fundamental ideas and developed practical tools to make use of them. He created much of the World Wide Web as it exists today, including the HTTP protocol, HTML, and the first browser. He currently directs the W3C and its project to standardize and promote the various semantic web initiatives.

Research Context:
The semantic web has obvious connections with data mining and computational linguistics (or natural language processing). However, the semantic web is a set of specifications that allow humans to make it simpler (by adding metadata) for machines to reason about relationships. In other words, there is no attempt to model actual human thinking or categorization. A more apt field for comparison perhaps is to library science or information management. Indeed, despite a large number of schemes to organize digital information, the proliferation of digital formats and content is overwhelming. Perhaps the advent of community defined ontologies (written in a standardized language such as OWL) will allow for more specialized retrieval tools and more appropriate organization of digital resources.

Projects such as Friend of a Friend and MusicBrainz use RDF tags to encapsulate relationships, but so far no applications utilizing the semantic web initiatives are especially compelling. An interesting and related project ConceptNet (described in another Transliteracies report) also encodes semantic meaning.

Evaluation of Opportunities/Limitations for the Transliteracies Topic:

We already have an ideal tool in which to impart meaning– language. However, we have not yet created advanced software which can effectively use or amplify language. There is a very interesting opportunity to utilize the tools of the semantic web to encode the complex, multifaceted, intellectual relationships and meanings that are found in a novel or a well-written article. Despite the apparent complexity of the semantic web, it is still clearly a very simplistic model of meaning creation. Most examples of RDF tags are used to encode only the most basic level of categorization. Even the speculative scenarios of the possible usefulness of the semantic web are generally concerned with more efficient commerce or faster access to information. Meaning is contextual, and is layered, polysemous, and often contradictory. Literary analysis examines these contexts, teases out contradictions, and examines broader cultural implications. Is there a way to merge the tools of textual and literary analysis with the syntactic tools that the semantic web offers? Can experts of language participate in the creation and application of the next generation of perhaps one of the most transformational technologies in the history of civilization?


[1] OWL Features, http://www.w3.org/TR/2004/REC-owl-features-20040210
[2]OWL Use Cases and Requirements, http://www.w3.org/TR/2004/REC-webont-req-20040210
[3] RDF Primer, http://www.w3.org/TR/2003/WD-rdf-primer-20030123
[4]OWL Guide, http://www.w3.org/TR/2004/REC-owl-guide-2004”0210
[5]OWL Guide, http://www.w3.org/TR/2004/REC-owl-guide-20040210
[6]W3C Semantic Web Activity, http://www.w3.org/2001/sw/
[7]IETF Request For Comments Index, http://www.ietf.org/rfc.html
[8]Wikipedia: Semantic Web, http://en.wikipedia.org/wiki/Semantic_Web
[9]Zambonini, Dan. “The 7 (f)laws of the Semantic Web”.
[10] Spivak, Nova. “The Ontology Integration Problem”. http://novaspivack.typepad.com/nova_spivacks_weblog/2006/08/the_ontology_in.html
[11] Sheth, Amit. “Marrying Social Media with Semantic Media”. http://lsdis.cs.uga.edu/~amit/blog
[12]Nelson, Theodor. “Xanalogical Structure…”. http://xanadu.com.au/ted/XUsurvey/xuDation.html
[13]Whitehead, Jim. “Orality and Hypertext: An Interview with Ted Nelson”, http://www.ics.uci.edu/~ejw/csr/nelson_pg.html.

Resources For Further Study:

  • Berners-Lee, Tim and Fischetti, Mark. “Weaving the Web: Origins and Future of the World Wide Web.” Orion Business. 1999.
  • Packer, Randall and Jordan, Ken. multiMEDIA: From Wagner to Virtual Reality. Norton. 2001
  • Engelbart, Douglas. “Augmenting Human Intellect: A Conceptual Framework”. Summary Report AFOSR-3233. Stanford Research Institute. (1962).
  • Bush, Vannevar. “As We May Think,” Atlantic Monthly. v. 176/1 (1945) 101-108.
  • The World Wide Web Consortium
  • Ted Nelson’s Transliterature
  • Wikipedia: OWL
  • Wikipedia: Memex
  • Wikipedia: Ontology
  • Ted Nelson — Hypertext, Xanadu, Web History
  • HyperScope
  • The Friend of a Friend Project
  • MusicBrainz
  • Dublin Core Metadata Initiative
  •   tl, 10.06.06

    2 Responses to Semantic Web

    1. permatasari says:

      There is an Ontogloss, ontology based annotation tool that uses pre—defined concepts in ontology to mark—up a document. Just share!

    2. Transliteracies research report — ANGUS GRAEME FORBES says:

      [...] from Transliteracies [...]