This post carries on from where the previous post in this series ended.
A very quick recap, this series is trying to better understand the EOT2008 and the EOT2012 web archives. The goal is to see how they are similar, how they are different, and if there is anything that can be learned that will help us with the upcoming EOT2016 project.
The CDX files we are using has a column that contains the Media Type (MIME Type) for the different URIs in the WARC files. A list of the assigned Media Types are available at the International Assigned Numbers Authority (IANA) in their Media Type Registry.
This is a field that is inherently “dirty” for a few reasons. This field is populated from a field in the WARC Record that comes directly from the web server that responded to the initial request. Usually these are fairly accurate but there are many times where they are either wrong or at the least confusing. Often times this is caused by a server administrator, programmer, or system architect that is trying to be clever, or just misconfigured something.
I looked at the Media Types for the two EOT collections to see if there are any major differences between what we collected in the two EOT archives.
In the EOT2008 archive there are a total of 831 unique Mime/Media Types, in the EOT2012 there are a total of 1,208 unique type values.
I took the top 20 Mime/Media Types for each of the archives and pushed them together to see if there was any noticeable change in what we captured between the two archives. In addition to just the raw counts I also looked at what percentage of the archive a given Media Type represented. Finally I noted the overall change in those two percentages.
|Media Type||2008 Count||% of Archive||2012 Count||% of Archive||% Change||Change in % of Archive|
Because I like pictures here is a chart of the percent change.
If we compare the Media Types between the two archives we find that the two archives share 527 Media Types. The EOT2008 archive has 304 Media Types that aren’t present in EOT2012 and EOT2012 has 681 Media Types that aren’t present in EOT2008.
The ten most frequent Media Types by count found only in the EOT2008 archive are presented below.
The ten most frequent Media Types by count found only in the EOT2012 archive are presented below.
In the EOT2012 archive the team that captured content had fully moved to the WARC format for storing Web archive content. The warc/revisit records are records for URLs that had not changed content-wise across more than one crawl. Instead of storing the URL again, there is a reference to the previously captured content in the warc/revisit record. That’s why there are so many of these Media types.
Below is a table showing the thirty most changed Media Types that are present in both the EOT2008 and EOT2012 archives. You can see both the change in overall numbers as well as the percentage change between the two archives.
|Media Type||EOT2008||EOT2012||Change||% Change|
Presented as a set of graphs, first showing the change in number of instances of a given Media Type between the two archives.
The second graph is the percentage change between the two archives.
There were a number of Media Types that reduced in the number and percentage but they are not as dramatic as those identified above. Of note is that between 2008 and 2012 there was a decline of 100% in content with a Media Type of application/x-cgi and a 78% decrease in files that were application/postscript.
If you have questions or comments about this post, please let me know via Twitter.