Transliteracies » Blog Archive

Internet Archive

(created 2/12/06; version 1.0)

Original Object for Study description

Summary:
First launched in 1996, the Internet Archive is a San Francisco-based, non-profit organization that curates and maintains an accessible, online archive of a vast amount of multimedia objects that have been published on the web since the popularization of the Internet in the mid-nineties.

A self-proclaimed repository of knowledge, the Internet Archive aims to create a reliable database of online works and to give, in essence, the web a stable “memory” that will endure beyond any individual web page’s date of expiration. The Internet Archive additionally seeks to keep this access open and free to the public. While the ten-year-old project remains a work in progress, its continuing development has already raised many intriguing challenges in relation to what some have argued is the transitory and ephemeral nature of online reading sources.

Description:
Currently housed in San Francisco’s Presidio, the Internet Archive was founded in 1996 by Brewster Kahle, an MIT graduate and developer of the WAIS (Wide Area Information Server) system, an early searching technology that allowed a user to search digitized texts that were not centrally located but were, rather, distributed across a computer network. While the WAIS system is no longer widely employed, the ability to perform vast searches and index findings efficiently across the web is at the heart of the Internet Archive project.

With the Internet Archive, Kahle envisioned a way to give the Internet a “memory” of its contents–a stable repository of information that would remain intact even if individual web authors or publishers decided to pull the plug on any one page. As the “About the Internet Archive” web page states:

The Internet Archive is working to prevent the Internet–a new medium with major historical significance–and other “born-digital” materials from disappearing into the past.” (http://www.archive.org/about/about.php).

Although the project has not yet archived every single web page ever authored in the history of the web, by taking selective “snapshots” of the entire web—sixty billion pages every two months—the project has seen an enormous, even geometric, growth in its resources. While the IA in its earliest stages started with mere megabytes and gigabytes of data, it now boasts one full petabyte of searchable information, which is equivalent to ten bytes to the fifteenth power of data–larger than the text collection maintained by the Library of Congress.

In addition to maintaining vast amounts of information, the Internet Archive is committed to making its archived contents available to the public. Although there is something of a lag time between the Internet Archive’s initial searching and its subsequent indexing of that information, all of its indexed information is fully and freely accessible through a simple and straightforward interface that allows users to search for earlier versions of websites that have either morphed over time or simply become defunct. This interface is called the “WayBack Machine,” which takes its name from “Peabody’s Improbable History,” a frequent short on the 1960s-televised Rocky and Bullwinkle Show. The WayBack Machine permits the user to enter a URL into an empty field; this URL is then sent to the Internet Archive’s data base, which matches it against its archives, and then sends back snapshots of any corresponding sites that the IA has indexed since 1996. Additionally, if the user has the necessary programming skills and wishes to access information that has not yet been indexed, the Internet Archive allows the user to search its data in a more “hands-on” manner, without the benefit of the WayBack Machine interface.

Image of the WayBack Machine Interface

Research Context:
In terms of its technological capabilities, the Internet Archive is well situated within the research context of information storage and retrieval known as data mining. Not only does the Archive make use of “crawling” technology–i.e., data mining technology that sends out probing entities known as “crawlers” or “web bots,” to search, index, and catalogue current information on the world wide web–it also employs this same technology to store a stable repository of web sites that have changed over time, websites that have become outdated “cobwebs,” and even websites that have, for whatever reason, become defunct.

Because its technology foregrounds issues related to what has been frequently criticized as the immateriality of online works, the Internet Archive is of additional interest to the study of online reading practices.

Technical Analysis:
Like Google, Yahoo, and Excite, the Internet Archive makes use of an impressive array of data mining technologies to maintain its information. One notable distinction, however, of the technology driving the Internet Archive is its transparency. IA developers have made tangible, sustained efforts to explain and make available the technical specifications of their searching technology, offering on their web site introductory comments as well as more detailed information and schematics for programmers in the public domain. Additionally, they maintain several forums devoted to discussions and questions about their specific technologies. One such technology is the Internet Archive crawler, the “Heretrix,” which is “an open source archival quality web crawler” that searches the web in a variety of ways in its cataloguing efforts:

Heretrix…is an archaic word for heiress (woman who inherits). Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt. (http://crawler.archive.org).

Because of the efficacy of the Heretrix crawler, the Internet Archive has seen an exponential growth of its resources, such that they now have over a petabyte of information to contend with—information that needs to be maintained, indexed and, stored. To address storage needs, in particular, the Internet Archive has designed and trademarked a unique storage system known as the “petabox.”

Image of the “Petabox”

Evaluation of Opportunities/Limitations for the Transliteracies Project:
“What is a country without a memory of its cultural heritage? Internet libraries are the place to preserve the aspect of a country’s heritage that exists on the Internet” (From the Internet Archive’s “About Us” page).

The above quote from the Internet Archive’s home page expresses well why the Archive might be an object worthy of further study and investigation by the Transliteracies Project. While traditional libraries have served as valuable vanguards of the cultural heritage contained in print, there has yet to be an effective and systematic approach to preserving cultural artifacts that have manifested–or, to use the words of Brewster Kahle, “been digitally born”–on the Internet.

Although Kahle’s and others’ efforts appear a laudable step in this direction, one wonders how selective, objective, and reliable this memory really is. For example, it is worth noting that while the initial purposes of the Internet Archive were to maintain a stable repository of information published on the web and grant free public access to this same information, the project has seen some notable, perhaps unforeseen, consequences in American legal proceedings.

In 2002, for example, the Internet Archive received pressure from lawyers for the Church of Scientology, who demanded that certain websites that were unfavorable to Church interests be removed from the WayBack Machine. The Internet Archive responded to the Church’s demands and removed the offending pages. Additionally, in late 2005, the Internet Archive removed free archived sound files of Grateful Dead concert footage in response to pressure from several of the band’s members.

While the above two cases are examples of organizations seeking to expunge information from the Internet Archive’s public record, equally interesting is a 2004 court case (“Telewizja Polska USA, Inc. v. Echostar Satellite”), in which snapshots found through the WayBack Machine’s search function were admitted as evidence by the court.

Additionally, the regulation and removal of sensitive archival documents on this burgeoning archive may prove troubling in ways that are not as applicable to print repositories. For example, if something is admitted to the public record in print, the material nature of a print document helps to ensure that the item will endure over time. There does not seem to be such a guarantee in regard to the contents of the Internet Archive. In fact, the Archive has chosen to distance itself from overly controversial web sites. As they state on the Frequently Asked Questions portion of their website:

The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine. (From the FAQ page.)

In addition to its respect for the robots exclusion protocol, the Internet Archive will, as the Scientology case indicates, remove controversial web pages from its data base at the bequest of outside interests. If the Internet Archive is truly aiming to provide an objective and stable repository for the changing contents of pages published publicly on the world wide web, then should people really be able to opt out of its searches? Additionally, if the Internet Archive is really offering a stable and reliable “public record,” should anyone be able to demand the removal of public documents it finds objectionable?

Despite these questions, the assurance of a stable, online repository that makes a systematic index and keeps an honest record of the Internet’s historical contents might go a long way to assuage fears about the immaterial and ephemeral nature of online texts. How this might, in turn, effect future reading practices—an issue that is quite central to the Transliteracies Project—remains to be seen.

Resources for Further Study:

The Internet Archive’s Home Page

The Way Back Machine

Information about the Heretrix Crawler

CNET News Article about the Scientology Controversy

Point(s) for Expansion:
The Internet Archive’s contribution to the Open Content Alliance

Research Reports (Chronologically)

tl, 02.12.06 1

One Response to Internet Archive

Publications | swanstream says:
March 16th, 2012 at 4:05 pm
[...] Short reports about key artworks, objects, and technologies within the field of online reading. The Internet Archive “esc for escape” Google Print Inform.com “The Legible City” Sony Reader “Reading as [...]

Research in the Technological, Social, and Cultural Practices of Online Reading

Internet Archive

One Response to Internet Archive