Blogdex is a research project from the MIT Media Laboratory that traces the diffusion of content, represented in the form of hypertext links, over time, through blogs.
Programs such as Blogdex offer a window into the networking structure of the blogging community, an opportunity to systematically analyze large textual datasets, and a way to think about meaning in the online environment.
Blogdex analyzes data from weblogs, or blogs, as they are commonly known. Blogs are websites, generally produced by a single individual, that contain regular postings on content of the blogger’s choice. In the standard blogging format, bloggers include hypertext links to other sources in their commentary. Blogs are generally updated frequently, often multiple times a day, and formatted so that the most recent post appears at the top of the screen, followed in reverse chronological order by earlier posts. Blogging programs allow readers to publish comments (including their own links in their comments) on the blog’s site in response to posts.
The blogosphere is vast and increasingly popular–at the 2004 U. S. national political conventions, bloggers received press access and their own press area. It is also often self-referential, with bloggers reading and posting to other blogs.
Begun in 2001 in the MIT Media Lab, Blogdex is a program that documents a portion of the way the blogging network works. It maintains a database of blogs–bloggers are invited to add their blog to the Blogdex database–and searches the blogs each time these are updated. In searching the blogs, Blogdex captures all links included in the posts. Additionally, Blogdex tracks data for each of the blogs in its database, including when each blog was added to the database, the number of links to each blog, and the number of links posted by each blog.
Blogdex offers two forms of data analysis. First, Blogdex allows researchers to consider the blogging database in terms of links. Blogdex tracks links in the blogs, ranking them by frequency of appearance and tracing their inclusion in blogs over time, capturing when and where they show up. Blogdex considers links a proxy for the content being discussed in a given post. In this way, the diffusion of links across blogs, over time, maps the diffusion of content through the blogosphere.
Second, Blogdex allows researchers to look at the data in terms of individual blogs. Blogdex lists all links in a given blog, showing the popularity of each link in the larger blogging community. Again considering links a proxy for content, this data communicates the extent to which a particular blog covers unique content.
In an interesting twist, Blogdex has a blogging section of its own where it posts and receives comments on the program.
There is extensive work on social networks and the diffusion of information. Currently, there are two weaknesses in this field that programs like Blogdex address. First, much social network research has depended on self-reported ties which can entail problems of reporting bias. Blogdex takes as its data content that is not self-reported, avoiding this issue.
Second, because the blogging community does not have an easily-accessible physical location or a definable population, it cannot be studied as a social network using traditional methods such as ethnography, surveys, or interviews. Blogdex does not require a confined community, physical proximity, or access to individuals. Using a method that does not depend on these components, Blogdex is able to capture data on the blogging social network.
Blogdex uses a crawler on all blogs in its database to capture any newly added links and a ranking algorithm to produce the ranked list of links. Blogdex uses the Apache::ASP perl-script to build its dynamic webpages.
Evaluation of Opportunities/Limitations for the Transliteracies Topic:
Blogdex is an example of a program of interest to the Transliteracies project in several ways. With Blogdex, we can gain insight into the functioning of blogging communities, capture and interpret large quantities of textual data, and think about how we understand and measure meaning in the online context.
We can use Blogdex to understand the networks of bloggers. By tracing the ways information travels–the paths it takes according to Blogdex–we can understand the social network construction of the online blogger world. This data can be compared with findings about offline networks to speak to whether and how the online environment functions differently than offline environments. Data of this sort can tell us the structure of the online networks. Are they hierarchical or flat, static or dynamic, a single universal network or multiple discrete networks?
We can also look at the Blogdex data in terms of the content in the links. Link content identifies which issues have contemporary resonance, taking the pulse of a disaggregated community of bloggers and discovering what topics are salient and at what time.
Part of the reason we can do this stems from the capacity of Blogdex to analyze large quantities of textual information systematically. The blogosphere is constantly growing and prohibitively large for traditional social network analysis methods. Blogdex is a useful example of a computer-based approach to distilling the content of huge quantities of text.
Given the innovation and value of this approach that makes enormous datasets more manageable, it is important to query the underlying assumptions of these analyses. Blogdex promises to capture the diffusion of content through the blogging community. Content, in this program, is understood as represented by links included in blogging posts. While obviously an important step in attacking the question of culling meaning from online texts, taking links as a proxy for content is hardly exhaustive. The use of links and their absence can be interpreted in more than one way. For example, we cannot assume that the absence of a link means that the blogger never discusses the issue covered in that link. Likewise, Blogdex does not capture link sequencing wherein Blogger A links to a news story and then Blogger B links to A’s post with the link on the news story. There are further limitations to the depth of information we can glean from Blogdex data. We cannot, for instance, know the tone of discussion regarding a particular link–was the blogger supportive, indifferent, or opposed? We need to think about other ways to categorize, capture, and understand meaning in large text-based datasets.
Resources for Further Study:
Points for Expansion:
Other similar programs:
- Daypop searches and indexes the “living web” (sites that update at least once a day; currently approximately 59,000 sites).
- Technorati is a real time search engine that tracks 32.7 million sites and 2.2 billion links in the blogosphere.
- Popdex crawls over 14,000 sites to determine the most popular links on the Internet.