This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.
In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.
Top Level Domains
The first thing to look at is the top level domains for all of the URLs in the CDX files.
In the EOT2008 archive there are a total of 241 unique TLDs. In the EOT2012 archive there are a total of 251 unique TLDs. This is a modest increase of 4.15% from EOT2008 to EOT2012.
The EOT2008 and EOT2012 archives share 225 TLDs between the two archives. There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.
TLDs unique to EOT2008
|Unique to 2008||URLs from TLD|
TLDs unique to EOT2012
|Unique to 2012||URLs from TLD|
I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008. I will have to do some digging to confirm.
Change in TLD
Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD
|TLD||eot2008||eot2012||Change||Absolute Change||% change|
Interesting is the change in the first two. There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012. There was also a decrease (-21%) or over 28 million URLs for the gov TLD. The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.
You can see that .ly and .me increased by some serious percentage, 529,855% and 522,651% respectively.
Taking a look at just the percent of change, here are the five most changed based on that percentage
|TLD||eot2008||eot2012||Change||Absolute Change||% change|
I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.
There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.
There are 30,066 domain names that are shared between the two archives. There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.
Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.
Here is the same kind of table but this time for the EOT2012 dataset.
Those are pretty long tables but I think they start to point at some interesting things from this analysis. The domains that were present and harvested in 2008 and that weren’t harvested in 2012. In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.
In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project. Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.
Here is a list of the twenty most changed domains from EOT2008 to EOT2012. This number is based on the absolute change in the number of URLs captured for each of the archives.
|Domain||EOT2008||EOT2012||Change||Abolute Change||% Change|
You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.
Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%). I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.
Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).
.gov and .mil subdomains
The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls. This time I wanted to look at the subdomains.
I filtered my dataset a bit so that I was only looking at the .mil and .gov content.
In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains. This is an increase of 40% between the two archive projects.
The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.
Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below. (Limited to the top 30)
Here is the same sort of data for the EOT2012 dataset
This last table is a little long, but I found the data pretty interesting to look at. The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.
|Subdomain/Domain||EOT2008||EOT2012||Change||Absolute Change||% Change|
I find this table interesting for a number of reasons. First you see quite a bit more decline that I have seen in my other tables like this. In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.
In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.
I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).
There are a few more things I want to look at in this work so stay tuned.
If you have questions or comments about this post, please let me know via Twitter.