Research Report by Lindsay Thomas
The goal of the Social Networks and Archival Contexts Project (SNAC) is to rethink the ways in which primary humanities resources are described and accessed. A collaborative project from the Institute for Advanced Technology in the Humanities at the University of Virginia, the School of Information at the University of California at Berkeley, and the California Digital Library, SNAC uses the new standard Encoded Archival Context – Corporate Bodies, Persons, and Families (EAC-CPF) to “unlock” the descriptions of the creators of archives from the records they have created. The project aims to create open-source tools, not yet released, that allow archivists to separate the process of describing people from that of describing records and to build a prototype online platform, released in December 2010 and currently in alpha stage, that links descriptions of people to one another and to descriptions of a wide variety of resources. The project received $348,000 over two years starting in May 2010 from the National Endowment for the Humanities, and the project team is directed by Daniel Pitti and Worthy Martin and also includes Ray Larson, Brian Tingle, Adrian Turner, and Krishna Janakiraman [i].
Traditionally, archivists describe the creators of records with the description of the records themselves. Finding aids, which are descriptions of archival collections, include information on the provenance or the context within which a particular collection was created, and this contextual information centers on identifying and describing the creator of the records and the time and place in which the creator was active. Such biographical and historical information is documented in the finding aid through formal references to other individuals and groups associated with the creator or in the description of the records themselves. In this way, finding aids “contain the names of people, corporate bodies, and families that are connected in some manner to the creator, which makes them an excellent documentary source of information on the professional and social networks within which record creators were active” [ii].. The traditional way in which archivists link detailed descriptions of the creators of collections together with the description of the collections themselves makes this rich historical and biographical information difficult to access. The goal of SNAC is to separate these descriptions of people, families, and corporate bodies from finding aids and link them together in order to deepen our understanding of primary humanities sources and both the people these sources describe and the people who created these descriptions.
To this end, SNAC uses the newly adopted archival standard EAC-CPF, which is an XML schema for encoding contextual information about people, families, and corporate bodies in archival materials. EAC-CPF was created “to standardize the encoding of descriptions about agents to enable the sharing, discovery and display of this information in an electronic environment” [iii].. EAC-CPF supports the linking of information about one creator or agent to other creators or agents and descriptions of records and thus makes it possible for SNAC to separate information about the creators of records from the records themselves. One goal for SNAC is the creation of open-source tools for archivists that make such separation and extraction easy. At the time of this writing, these tools have not yet been released.
Another goal is the development of a prototype access system that links descriptions of people to one another and to other resources. For the creation of this “historical resource and access system,” SNAC has derived or will derive EAC-CPF records from existing finding aids from the Library of Congress, the Online Archive of California, the Northwest Digital Archive, and Virginia Heritage [iv].. To date, only the Virginia Heritage finding aids have not yet been processed by SNAC. The EAC-CPF records derived from these finding aids are then compared to one another to combine records for the same agents and, additionally, are compared to authority records from the Library of Congress, the Getty Vocabulary Program, and OCLC Research. The prototype system was released in mid-December 2010 in alpha stage [v].. Users can search through all of the data, for people, for corporate bodies, and for families by clicking on the appropriate tab at the top of the prototype’s home page. Once users have selected the appropriate filter, they can search for specific names and/or keywords using the search bar or using the alphabetical index. Users can also view the “Top Occupations” and “Top Subjects” – determined by the number of times a particular occupation or subject is mentioned in connection with people, corporate bodies and families in the finding aids processed – for each class of data. Users can also discover the total number of entries in each class by clicking on the appropriate tab: for example, by clicking on the “All” tab, users see that the total data set includes 123,920 names of people, corporate bodies, and families.
When searching for a specific person, for example, users can choose how to display the search results by hovering over the search bar. If users want to limit the search to only those records associated with the specific person’s identity, they can do this by clicking on the drop down menu that appears next to the “advanced…limit to section” search options and selecting “identity.” The terms listed in this drop down menu are tags determined by Encoded Archival Context (EAC) standards. Users can view a Google spreadsheet with descriptions of these terms by clicking on the word “section” next to the dropdown box.
A search for the name “Thomas Edison,” for example, results in 4,873 EAC-CPF records when the default search term, “cpfDescription,” is selected. This is the number of records found in all of the finding aids processed that mention the name Thomas Edison. Users can narrow these results by “Occupation” or “Subject,” which are listed to the left of the EAC-CPF search results. Limiting the results using the advanced search options drop down menu will also change the total number of results. Choosing the option “identity” in the drop down menu, for example, will only result in 3 records: two records for Thomas Edison himself and one for his wife (listed as “Edison, Mrs. Thomas A.”). The identity search tag, then, limits results to those only concerning the entity, in this case Thomas Edison, and those using the entity’s name. As we can see in the figure above, the advanced search tags limit the results in a variety of ways. Clicking on the second identity search result, “Edison, Thomas A. (Thomas Alva) 1847-1931” results in a page, catalogued according to the Anglo-American Cataloging Rules, Second Edition (AACR2), listing biographical information (although none exists for this record) and the related entries associated with the selected record. As we see in the figure below, users can choose to view alternative names for Thomas Edison, the source EAC-CPF XML code, a random record, or users can report a data issue to SNAC. Users can also choose to view the specific archival collections from which the entity record has been taken (“Archival Collections”), the creators of archives that contain the entity name (“People”), corporate bodies that reference the entity name (“Corporate Bodies”), the specific records that contain the specific entity name (“Resources”), and links to related authority records (“Linked Data”). In this way, SNAC’s EAC-CPF record for Thomas Edison is linked to other entities described in other EAC-CPF records, producing a social network of sorts. What’s more, the type of relationship between entities is listed next to the entity entry. As we see in the figure below, for example, “Benny, William A.” is described as having “associatedWith” Edison. Although SNAC plans to “[experiment] with social graphs and visualizations that allow users to explore and navigate the social and professional relations among the described entities,” this option is not yet available [vi]..
Users can also choose to search for “Thomas Edison” using the “Corporate Body” tag. A search for Edison in this category results in 1,539 EAC-CPF records using the default cpfDescription search tag. These results include records about the Edison Electric Light Company, the Thomas A. Edison Corporation, the Thomas Alva Edison Foundation, and about corporations that reference or were associated with Edison’s company, corporation and/or foundation. As with the “Person” tab, users can narrow the corporate body search results by subject. Selecting one result, for example “Thomas A. Edison, Inc.” results in a page that, as with the person tab, lists biographical information and related entries. This page also lists the subjects associated with this record on the left.
Finally, users can search for specific families using the “Family” tab. A search for Thomas Edison under family yields no results, but a search for “Abraham Lincoln,” for example, yields 11 EAC-CPF results. As with the person and corporate body tabs, users can narrow these results by occupation and subject. Selecting a record to view will also result in a page that lists biographical information (if available) and related entries.
SNAC consists of a set of tools and a platform for humanities archival research. As such, it is designed for a smaller audience, namely archivists and scholars, than RoSE, which has the potential for use at a variety of levels and in a variety of settings. Within its own context, SNAC has the potential to transform the ways in which archival research is done; harnessing the newly released EAC-CPF standards in order to separate descriptions of creators of archives from descriptions of archives themselves could add much to existing knowledge about archival collections and the contexts in which they were created. The well-defined goals outlined on the SNAC website and in the SNAC NEH Proposal are good models for RoSE as we consider the place of RoSE. The SNAC project team is very specific about what it wants to accomplish, the work SNAC does and the gap SNAC fills in scholarly communities. Additionally, the ways in which SNAC processes, compares, and matches finding aids to ensure broad coverage while at the same time eliminating duplicate records can serve as a model for dealing with large amounts of data. SNAC is able to efficiently cross-reference, combine, and relate its records to one another intelligibly and quickly. However, the social network aspect of SNAC seems, at this point, to be more of an afterthought than a main focus. In contrast to RoSE, which is a true social environment – and which allows for user participation – SNAC is more of a tool for scholars and archivists that also has the potential to be a social network.
SNAC uses Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF) for its descriptions of entities; EAC-CPF is “an international standard for authority control of corporate body, person, and family name entries and biographical or historical description expressed as an Extensible Markup Language (XML) schema” [vii].. EAC-CPF is designed to support many International Standard Archival Authority Record for Corporate Bodies, Persons, and Families (ISAAR) descriptive components, including identity, description, entries for related entities, entries for related resources by or about the described entity, entries for function authority control records, and information used in maintaining the record. The primary source of name and biographical information for the EAC-CPF records derived in SNAC is from finding aids represented in Encoded Archival Description (EAD), which is “an international communication standard for archival description expressed as an XML schema” [viii].. To interrelate EAC-CPF entity descriptions and EAC-CPF with EAD-encoded finding aids, SNAC uses the Resource Description Framework (RDF). RDF, like XML, is a web standard that is widely supported with a variety of open-source tools. RDF facilitates the processing and linking of data and easily connects the SNAC access system to DBPedia, a knowledge base of structured information extracted from Wikipedia that contains over 312,000 people and 140,000 organizations [ix]..
SNAC’s prototype access system is based on Extensible Text Framework (XTF), an open-source platform for publishing XML-encoded documents and data. SNAC also uses Extensible Stylesheet Language-Transformation (XSLT) and XML Path Language (XPATH) in the extraction of information from EAD-encoded finding aids, Machine-Readable Cataloguing (MARC) Authority and the Getty Vocabulary Program’s Union Lists of Artist’s Names (ULAN) data, and the processing and rendering of this information in EAC-CPF records.
Evaluation of Opportunities and Limitations for RoSE:
In addition to the project team’s clear grasp of their goals for SNAC and SNAC’s ability to efficiently process, compare, and match large amounts of data, SNAC is also a useful model for RoSE because of the source material it uses. RoSE has imported data from Project Gutenberg and Yago, both large and easily accessible knowledge bases. SNAC, on the other hand, because it is coming out of the archivist community, uses finding aids for its source material. Finding aids are a potentially rich source of data for RoSE, especially for biographical information on people. They contain a broad range of historical and biographical information, and thanks to the new EAC-CPF standards and projects like SNAC, this information is becoming more easily available. What’s more, importing data from scholarly finding aids into RoSE would provide a nice counter-balance – and, in some instances no doubt, a nice corrective – to the data imported from Yago, which harvests its information from Wikipedia.
Despite its potential as a model in grant-writing and data-harvesting for RoSE, SNAC and RoSE are quite different in many respects. One of the most interesting aspects about SNAC, and that which separates it from other archival tools, is its use of EAC-CPF to separate descriptions of individuals and groups of individuals from the records in which they are contained and to link these descriptions together. In this way, SNAC aims to create a social network of historical and living people and the groups to which they belong and/or belonged. This is also, of course, one of RoSE’s goals. However, at this early stage in the development of SNAC, this social network is more implicit than explicit. Much historical and biographical data has already been processed and linked together, but SNAC has not yet developed its visualization tools. Data visualization has to some extent been at the heart of the idea behind RoSE, and much of RoSE’s potential seems to lie in its ability to efficiently and quickly provide snapshots of the many connections between documents and people. SNAC, on the other hand, has focused first on linking entities together textually rather than visually, creating lists rather than diagrams of connections. This has the effect of suggesting rather than depicting social networks.
Both RoSE and SNAC are invested in describing the many types of relationships that link entities together (author, collaborator, husband, etc). In RoSE, over 70 options are available for describing the kind of relationships between people alone. While users must pick from this wide variety of options for describing relationship types in RoSE, in SNAC, relationship types are described according to EAC-CPF standards, which allow for flexible but limited descriptions of relationship types [x].. Because of this, descriptions of the relationships between people are often limited to vaguer phrases like “associatedWith” and “correspondedWith,” and the relationships between a person and a document – usually a record – is often simply “creatorOf.” The goal for SNAC, then, is not to describe relationships between entities in much detail but rather to use a controlled vocabulary to connect entities to form a large but less detailed network. What’s more, although RoSE automatically harvests data from Yago and Project Guttenburg, RoSE also allows users themselves to input new data, add detail to existing data, and to describe relationship types; in SNAC, all relationship types and data are given to the user. Thus, RoSE is both a user- and a machine-generated social network: users can insert themselves and their interests into various historical and contemporary worlds. SNAC, on the other hand, is a tool and a platform to be used for certain scholarly ends but not to be participated in.
Finally, because of this difference in how each system thinks about its users, RoSE seems more attuned to pedagogic and public use than SNAC. RoSE can be used in a variety of settings by a variety of people, and its Web 2.0 style is familiar, easy to use, and popular. SNAC is a powerful tool for a narrow audience; EAC-CPF standards are well known within certain groups, but these standards and the vocabulary SNAC uses are not easily translatable to a wider public or a less advanced audience of students. What’s more, RoSE’s prominent use of social network visualizations makes understanding the connections between people, documents, and groups quick and intuitive. SNAC’s emphasis, at this point, on textual rather than visual connections making discovering these links more difficult.
“Social Networks and Archival Context Project.” Archivology. 21 Aug 2010. Web. 10 Dec 2010.
“The Social Networks and Archival Context Project.” UC Berkeley School of Information. 1 Oct 2010. Web. 10 Dec 2010.
[i] Berger, Sherri, “SNAC project will use archival authority records to expand access,” CDL: California Digital Library, Web, 10 Dec 2010.
[ii] “NEH Proposal,” SNAC: The Social Networks and Archival Context Project, Web, 10 Dec 2010.
[iii] “Welcome to the EAC-CPF Homepage,” Encoded Archival Context – Corporate Bodies, People, and Families, Society of American Archivists, Web, 10 Dec 2010.
[iv] “Prototype,” SNAC: The Social Networks and Archival Context Project, Web, 19 Dec 2010.
[v] SNAC’s prototype is currently housed here: http://socialarchive.iath.virginia.edu/xtf/search.
[vi] “NEH Proposal,” SNAC.
[vii] “NEH Proposal,” SNAC.
[viii]“NEH Proposal,” SNAC
[x] See http://www3.iath.virginia.edu/eac/cpf/tagLibrary/cpfTagLibrary.html#cpfRelationType for further explanation of EAC-CPF tags and relationship types.