Comparing Web Archives: EOT2008 and EOT2012 – When

In 2008 a group of institution comprised of the Internet Archive, Library of Congress, California Digital Library, University of North Texas, and Government Publishing Office worked together to collect the web presence of the federal government in a project that has come to be known as the End of Term Presidential Harvest 2008.

Working together this group established the scope of the project, developed a tool to collect nominations of URLs important to the community for harvesting, carried out a harvest of the federal web presence before the election, after the election, and after the inauguration of President Obama. This collection was harvested by the Internet Archive, Library of Congress, California Digital Library, and the UNT Libraries.  At the end of the EOT project the data harvested was shared between the partners with several institutions acquiring a copy of the complete EOT dataset for their local collections.

Moving forward four years the same group got together to organize the harvesting of the federal domain in 2012.  While originally scoped as a way of capturing the transition of the executive branch,  this EOT project also served as a way to systematically capture a large portion of the federal web on a four year calendar.  In addition to the 2008 partners,  Harvard joined in the project for 2012.

Again the team worked to identify in-scope content to collect, this time however the content included URLs from the social web including Twitter and Facebook for agencies, offices and individuals in the federal government.  Because there was not a change in office because of the 2012 election, there was just a set of crawls that occurred during the fall of 2012 and the winter of 2013.  Again this content was shared between the project partners interested in acquiring the archives for their own collections.

The End of Term group is a loosely organized group that comes together ever four years to conduct the harvesting of the federal web presence. As we ramp up for the end of the Obama administration the group has started to plan the EOT 2016 project with a goal to start crawling in September of 2016.  This time there will be a new president so the crawling will probably take the format of the 2008 crawls with a pre-election, post-election and post-inauguration set of crawls.

So far there hasn’t been much in the way of analysis to compare the EOT2008 and EOT2012 web archives.  There are a number of questions that have come up over the years that remain unanswered about the two collections.  This series of posts will hopefully take a stab at answering some of those questions and maybe provide better insight into the makeup of these two collections.  Finally there are hopefully a few things that can be learned from the different approaches used during the creation of these archives that might be helpful as we begin the EOT 2016 crawling.

Working with the EOT Data

The dataset that I am working with for these posts consists of the CDX files created for the EOT2008 and EOT2012 archive.  Each of the CDX files acts as an index to the raw archived content and contains a number of fields that can be useful for analysis.  All of the archive content is referenced in the CDX file.

If you haven’t looked at a CDX file in the past here is an example of a CDX file.

gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:martinelli,%20giovanni&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005312 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AMartinelli%2C+Giovanni&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 LFN2AKE4D46XEZNOP3OLXG2WAPLEKZKO - - - 533010532 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:schumann-heink,%20ernestine&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005219 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3ASchumann-Heink%2C+Ernestine&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EL5OT5NAXGGV6VADBLNP2CBZSZ5MH6OT - - - 531160983 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:scotti,%20antonio&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005255 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AScotti%2C+Antonio&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 SEFDA5UNFREPA35QNNLI7DPNU3P4WDCO - - - 804325022 LOC-EOT2012-001-20121125003257404-04183-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:viafora,%20gina&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005309 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AViafora%2C+Gina&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EV6N3TMKIVWAHEHF54M2EMWVM5DP7REJ - - - 532966964 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:homer,%20louise&fq[1]=take_composer_name:campana,%20f.%20&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125070122 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AHomer%2C+Louise&fq%5B1%5D=take_composer_name%3ACampana%2C+F.+&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 FW2IGVNKIQGBUQILQGZFLXNEHL634OI6 - - - 661008391 LOC-EOT2012-001-20121125064213479-04227-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz

The CDX format is a space delimited file with the following fields

  • SURT formatted URI
  • Capture Time
  • Original URI
  • MIME Type
  • Response Code
  • Content Hash (SHA1)
  • Redirect URL
  • Meta tags (not populated)
  • Compressed length (sometimes populated)
  • Offset in WARC file
  • WARC File Name

The tools I’m working with to analyze the EOT datasets will consist of Python scripts that either extract specific data from the CDX files where it can be further sorted and counted, or they will be scripts that work on these sorted and counted versions of files.

I’m trying to post code and derived datasets in a Github repository called eot-cdx-analysis if you are interested in taking a look.  There is also a link to the original CDX datasets there as well.

How much

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  Unfortunately the CDX files that we are working with don’t have consistent size information that we can use for analysis but the rough sizes for each of the archives is EOT2008 at 16TB and EOT2012 at just over 41.6TB.

When

The first dimension I wanted to look at was when was the content harvested for each of the EOT rounds.  In both cases we all remember starting the harvesting “sometime in September” and then ending the crawls “sometime in March” of the following year.  How close were we to our memory?

For this I extracted the Capture Time field from the CDX file, converted that into a date yyyy–mm-dd was a decent bucket to group into and then sorted and counted each instance of a date.

EOT2008 Harvest Dates

EOT2008 Harvest Dates

This first chart shows the harvest dates contained in the EOT2008 CDX files.  Things got kicked off in September 2008 and apparently concluded all the way in OCT 2009.  There is another blip of activity in May of 2009.  This is probably something to go back and look at to help remember what exactly these two sets of crawling were that happened after March 2009 when we all seem to remember crawling stopping.

EOT2012 Harvest Dates

EOT2012 Harvest Dates

The EOT2012 crawling started off in mid-September and this time finished up in the first part of March 2013.  There is a more consistent shape to the crawling for this EOT with a pretty consistent set of crawling happening between mid-October and the end of January.

EOT2008 and EOT2012 Harvest Dates Compared

EOT2008 and EOT2012 Harvest Dates Compared

When you overlay the two charts you can see how the two compare.  Obviously the EOT2008 data continues quite a bit further than the EOT2012 but where they overlap you can see that there were different patterns to the collecting.

Closing

This is the first of a few posts related to web archiving and specifically to comparing the EOT2008 and EOT2012 archives.  We are approaching the time to start the EOT2016 crawls and it would be helpful to have more information about what we crawled in the two previous cycles.

In addition to just needing to do this work there will be a presentation on some of these findings as well as other types of analysis at the 2016 Web Archiving and Digital Libraries (WADL) workshop that is happening at the end of JCDL2016 this year in Newark, NJ.

If there are questions you have about the EOT2008 or EOT2012 archives please get in contact with me and we can see if we can answer them.

If you have questions or comments about this post,  please let me know via Twitter.