This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012. In previous posts I’ve looked at the when, what, and where of these archives. In doing so I pulled together the domain names from each of the archives to compare them.
My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012. Likewise I could look at new content domains that appeared since 2008. For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.
Comparing EOT2008 and EOT2012
The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common. There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.
The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs. When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs. When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.
The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.
The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.
Shared domains that changed
There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives. I thought it would be interesting to compare those domains and see which ones changed the most. Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.
|Domain||EOT2008||EOT2012||Change||Absolute Change||% Change|
Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.
Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content. I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest. I suspect it is some combination of all three of those things.
It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government. One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization. For example “what domains did the federal government have in 1998?”. It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012. Take a look at that list of the top thirty, did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?
If you have questions or comments about this post, please let me know via Twitter.