Comparing Web Archives: EOT2008 and EOT2012 – What disappeared

This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012. In previous posts I’ve looked at the when, what, and where of these archives. In doing so I pulled together the domain names from each of the archives to compare them.

My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012. Likewise I could look at new content domains that appeared since 2008. For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.

Comparing EOT2008 and EOT2012

The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common. There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.

Domains in EOT2008 and E0T2012

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs. When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs. When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.

The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.

Domain	Count
geodata.gov	812,524
nifl.gov	504,910
stat-usa.gov	398,961
tradestatsexpress.gov	243,729
arnet.gov	174,057
acqnet.gov	171,493
dccourts.gov	161,289
web-services.gov	137,202
metrokc.gov	132,210
sdi.gov	91,887
davie-fl.gov	88,123
belmont.gov	87,332
aftac.gov	84,507
careervoyages.gov	57,192
women-21.gov	56,255
egrpra.gov	54,775
4women.gov	45,684
4woman.gov	42,192
nypa.gov	36,099
nhmfl.gov	27,569
darpa.gov	21,454
usafreedomcorps.gov	18,001
peacecore.gov	17,744
californiadesert.gov	15,172
arpa.gov	15,093
okgeosurvey1.gov	14,595
omhrc.gov	14,594
usafreedomcorp.gov	14,298
uscva.gov	13,627
odci.gov	12,920

The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.

Domain	Count
militaryonesource.mil	859,843
consumerfinance.gov	237,361
nrd.gov	194,215
wh.gov	179,233
pnnl.gov	132,994
eia.gov	112,034
transparency.gov	109,039
nationalguard.mil	108,854
acus.gov	93,810
404.gov	82,409
savingsbondwizard.gov	76,867
treasuryhunt.gov	76,394
fedshirevets.gov	75,529
onrr.gov	75,484
veterans.gov	75,350
broadbandmap.gov	72,889
saferproducts.gov	65,387
challenge.gov	63,808
healthdata.gov	63,105
marinecadastre.gov	62,882
fatherhood.gov	62,132
edpubs.gov	58,356
transportationresearch.gov	58,235
cbca.gov	56,043
usbonds.gov	55,102
usbond.gov	54,847
phe.gov	53,626
ussavingsbond.gov	53,563
scienceeducation.gov	53,468
mda.gov	53,010

Shared domains that changed

There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives. I thought it would be interesting to compare those domains and see which ones changed the most. Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.

Domain	EOT2008	EOT2012	Change	Absolute Change	% Change
house.gov	13,694,187	35,894,356	22,200,169	22,200,169	162%
senate.gov	5,043,974	9,924,917	4,880,943	4,880,943	97%
gpo.gov	8,705,511	3,888,645	-4,816,866	4,816,866	-55%
nih.gov	5,276,262	1,267,764	-4,008,498	4,008,498	-76%
nasa.gov	6,693,542	3,063,382	-3,630,160	3,630,160	-54%
navy.mil	94,081	3,611,722	3,517,641	3,517,641	3,739%
usgs.gov	4,896,493	1,690,295	-3,206,198	3,206,198	-65%
loc.gov	5,059,848	7,587,179	2,527,331	2,527,331	50%
hhs.gov	2,361,866	366,024	-1,995,842	1,995,842	-85%
osd.mil	180,046	2,111,791	1,931,745	1,931,745	1,073%
af.mil	230,920	2,067,812	1,836,892	1,836,892	795%
ed.gov	2,334,548	510,413	-1,824,135	1,824,135	-78%
lanl.gov	2,081,275	309,007	-1,772,268	1,772,268	-85%
usda.gov	2,892,923	1,324,049	-1,568,874	1,568,874	-54%
congress.gov	1,554,199	40,338	-1,513,861	1,513,861	-97%
noaa.gov	5,317,872	3,985,633	-1,332,239	1,332,239	-25%
epa.gov	1,628,517	327,810	-1,300,707	1,300,707	-80%
uscourts.gov	1,484,240	184,507	-1,299,733	1,299,733	-88%
dol.gov	1,387,724	88,557	-1,299,167	1,299,167	-94%
census.gov	1,604,505	328,014	-1,276,491	1,276,491	-80%
dot.gov	1,703,935	554,325	-1,149,610	1,149,610	-67%
usbg.gov	1,026,360	6,724	-1,019,636	1,019,636	-99%
doe.gov	1,164,955	268,694	-896,261	896,261	-77%
vaccines.mil	5,665	856,188	850,523	850,523	15,014%
fdlp.gov	991,747	156,499	-835,248	835,248	-84%
uspto.gov	980,215	155,428	-824,787	824,787	-84%
bts.gov	921,756	130,730	-791,026	791,026	-86%
cdc.gov	1,014,213	264,500	-749,713	749,713	-74%
lbl.gov	743,472	4,080	-739,392	739,392	-99%
faa.gov	945,446	206,500	-738,946	738,946	-78%
treas.gov	838,243	99,411	-738,832	738,832	-88%
fema.gov	903,393	172,055	-731,338	731,338	-81%
clinicaltrials.gov	919,490	196,642	-722,848	722,848	-79%
army.mil	2,228,691	2,936,308	707,617	707,617	32%
nsf.gov	760,976	65,880	-695,096	695,096	-91%
prc.gov	740,176	75,682	-664,494	664,494	-90%
doc.gov	823,825	173,538	-650,287	650,287	-79%
fueleconomy.gov	675,522	79,943	-595,579	595,579	-88%
nbii.gov	577,708	391	-577,317	577,317	-100%
defense.gov	687	575,776	575,089	575,089	83,710%
usajobs.gov	3,487	551,217	547,730	547,730	15,708%
sandia.gov	736,032	210,429	-525,603	525,603	-71%
nps.gov	706,323	191,102	-515,221	515,221	-73%
defenselink.mil	502,023	1,868	-500,155	500,155	-100%
fws.gov	625,180	132,402	-492,778	492,778	-79%
ssa.gov	609,784	125,781	-484,003	484,003	-79%
archives.gov	654,689	175,585	-479,104	479,104	-73%
fnal.gov	575,167	1,051,926	476,759	476,759	83%
change.gov	486,798	24,820	-461,978	461,978	-95%
buyusa.gov	490,179	37,053	-453,126	453,126	-92%

Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.

Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content. I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest. I suspect it is some combination of all three of those things.

Summary

It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government. One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization. For example “what domains did the federal government have in 1998?”. It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012. Take a look at that list of the top thirty, did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?

If you have questions or comments about this post, please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – Where

This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.

In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.

Top Level Domains

The first thing to look at is the top level domains for all of the URLs in the CDX files.

In the EOT2008 archive there are a total of 241 unique TLDs. In the EOT2012 archive there are a total of 251 unique TLDs. This is a modest increase of 4.15% from EOT2008 to EOT2012.

The EOT2008 and EOT2012 archives share 225 TLDs between the two archives. There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.

TLDs unique to EOT2008

Unique to 2008	URLs from TLD
null	18,772
www	583
yu	357
labs	20
webteam	16
cg	10
security	8
ssl	8
b	8
css	7
web	6
dev	4
education	4
misc	2
secure	2
campaigns	2

TLDs unique to EOT2012

Unique to 2012	URLs from TLD
whois	17,500
io	7,935
pn	987
sy	541
lr	478
so	418
nr	363
tf	291
xxx	258
re	186
xn--p1ai	171
bi	153
dm	120
tel	78
ck	65
ax	64
sx	54
tg	50
ki	48
gg	25
kn	25
gp	24
pm	20
fk	18
cf	7
wf	3

I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008. I will have to do some digging to confirm.

Change in TLD

Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD

TLD	eot2008	eot2012	Change	Absolute Change	% change
com	7,809,711	45,594,482	37,784,771	37,784,771	483.8%
gov	137,829,050	109,141,353	-28,687,697	28,687,697	-20.8%
mil	3,555,425	16,223,861	12,668,436	12,668,436	356.3%
net	653,187	9,269,406	8,616,219	8,616,219	1319.1%
edu	3,552,509	2,442,626	-1,109,883	1,109,883	-31.2%
int	135,939	685,168	549,229	549,229	404.0%
uk	70,262	594,020	523,758	523,758	745.4%
ly	95	503,457	503,362	503,362	529854.7%
org	5,108,645	5,588,750	480,105	480,105	9.4%
us	840,516	474,156	-366,360	366,360	-43.6%
co	2,839	211,131	208,292	208,292	7336.8%
be	4,019	203,178	199,159	199,159	4955.4%
jp	23,896	220,602	196,706	196,706	823.2%
me	35	182,963	182,928	182,928	522651.4%
tv	10,373	191,736	181,363	181,363	1748.4%

Interesting is the change in the first two. There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012. There was also a decrease (-21%) or over 28 million URLs for the gov TLD. The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.

You can see that .ly and .me increased by some serious percentage, 529,855% and 522,651% respectively.

Taking a look at just the percent of change, here are the five most changed based on that percentage

TLD	eot2008	eot2012	Change	Absolute Change	% change
ly	95	503,457	503,362	503,362	529854.7%
me	35	182,963	182,928	182,928	522651.4%
gl	129	49,733	49,604	49,604	38452.7%
gd	9	3,273	3,264	3,264	36266.7%
cat	43	11,703	11,660	11,660	27116.3%

I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.

Domain Names

There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.

There are 30,066 domain names that are shared between the two archives. There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.

Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.

TLD	Count
geodata.gov	812,524
nifl.gov	504,910
stat-usa.gov	398,961
tradestatsexpress.gov	243,729
arnet.gov	174,057
acqnet.gov	171,493
dccourts.gov	161,289
meish.org	147,261
web-services.gov	137,202
metrokc.gov	132,210
sdi.gov	91,887
davie-fl.gov	88,123
belmont.gov	87,332
aftac.gov	84,507
careervoyages.gov	57,192
women-21.gov	56,255
egrpra.gov	54,775
4women.gov	45,684
4woman.gov	42,192
nypa.gov	36,099
secure-banking.com	33,059
nhmfl.gov	27,569
darpa.gov	21,454
usafreedomcorps.gov	18,001
peacecore.gov	17,744
californiadesert.gov	15,172
federaljudgesassoc.org	15,126
arpa.gov	15,093
transportationfortomorrow.org	14,926
okgeosurvey1.gov	14,595

Here is the same kind of table but this time for the EOT2012 dataset.

TLD	Count
militaryonesource.mil	859,843
yfrog.com	682,664
staticflickr.com	640,606
akamaihd.net	384,769
4sqi.net	350,707
foursquare.com	340,492
adf.ly	334,767
pinterest.com	244,293
consumerfinance.gov	237,361
nrd.gov	194,215
wh.gov	179,233
t.co	175,033
youtu.be	172,301
sndcdn.com	161,039
pnnl.gov	132,994
eia.gov	112,034
transparency.gov	109,039
nationalguard.mil	108,854
acus.gov	93,810
nrsc.org	85,925
mzstatic.com	84,202
404.gov	82,409
savingsbondwizard.gov	76,867
treasuryhunt.gov	76,394
mynextmove.org	75,927
fedshirevets.gov	75,529
onrr.gov	75,484
veterans.gov	75,350
broadbandmap.gov	72,889
ntm-a.com	71,126

Those are pretty long tables but I think they start to point at some interesting things from this analysis. The domains that were present and harvested in 2008 and that weren’t harvested in 2012. In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.

In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project. Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.

Here is a list of the twenty most changed domains from EOT2008 to EOT2012. This number is based on the absolute change in the number of URLs captured for each of the archives.

Domain	EOT2008	EOT2012	Change	Abolute Change	% Change
house.gov	13,694,187	35,894,356	22,200,169	22,200,169	162%
facebook.com	11,895	7,503,640	7,491,745	7,491,745	62,982%
dvidshub.net	1,097	5,612,410	5,611,313	5,611,313	511,514%
senate.gov	5,043,974	9,924,917	4,880,943	4,880,943	97%
gpo.gov	8,705,511	3,888,645	-4,816,866	4,816,866	-55%
nih.gov	5,276,262	1,267,764	-4,008,498	4,008,498	-76%
nasa.gov	6,693,542	3,063,382	-3,630,160	3,630,160	-54%
navy.mil	94,081	3,611,722	3,517,641	3,517,641	3,739%
usgs.gov	4,896,493	1,690,295	-3,206,198	3,206,198	-65%
loc.gov	5,059,848	7,587,179	2,527,331	2,527,331	50%
flickr.com	157,155	2,286,890	2,129,735	2,129,735	1,355%
youtube.com	346,272	2,369,108	2,022,836	2,022,836	584%
hhs.gov	2,361,866	366,024	-1,995,842	1,995,842	-85%
osd.mil	180,046	2,111,791	1,931,745	1,931,745	1,073%
af.mil	230,920	2,067,812	1,836,892	1,836,892	795%
ed.gov	2,334,548	510,413	-1,824,135	1,824,135	-78%
granicus.com	782	1,785,724	1,784,942	1,784,942	228,253%
lanl.gov	2,081,275	309,007	-1,772,268	1,772,268	-85%
usda.gov	2,892,923	1,324,049	-1,568,874	1,568,874	-54%
googleusercontent.com	2	1,560,457	1,560,455	1,560,455	78,022,750%

You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.

Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%). I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.

Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).

.gov and .mil subdomains

The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls. This time I wanted to look at the subdomains.

I filtered my dataset a bit so that I was only looking at the .mil and .gov content.

In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains. This is an increase of 40% between the two archive projects.

The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.

Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below. (Limited to the top 30)

EOT2008 Subdomain	Count
gos2.geodata.gov	809,442
boucher.house.gov	772,759
kendrickmeek.house.gov	685,368
citizensbriefingbook.change.gov	446,632
stat-usa.gov	305,936
nifl.gov	285,833
scidac-new.ca.sandia.gov	247,451
tradestatsexpress.gov	243,729
hpcf.nersc.gov	221,626
gopher.info.usaid.gov	219,051
novel.nifl.gov	218,962
dli2.nsf.gov	206,932
contractorsupport.acf.hhs.gov	188,841
pnwin.nbii.gov	188,591
faq.acf.hhs.gov	184,212
ccdf.acf.hhs.gov	182,606
arnet.gov	174,018
regulations.acf.hhs.gov	171,762
acqnet.gov	171,493
dccourts.gov	161,289
employers.acf.hhs.gov	139,141
search.info.usaid.gov	137,816
web-services.gov	137,202
earth2.epa.gov	136,441
cjtf7.army.mil	134,507
ncweb-north.wr.usgs.gov	134,486
opre.acf.hhs.gov	133,689
childsupportenforcement.acf.hhs.gov	132,023
modis-250m.nascom.nasa.gov	128,810
casd.uscourts.gov	124,146

Here is the same sort of data for the EOT2012 dataset

EOT2012 Subdomain	Count
militaryonesource.mil	698,035
uscodebeta.house.gov	387,080
democrats.foreignaffairs.house.gov	312,270
gulflink.fhpr.osd.mil	262,246
coons.senate.gov	257,721
democrats.energycommerce.house.gov	243,341
consumerfinance.gov	225,815
dcmo.defense.gov	217,255
nrd.gov	187,267
wh.gov	179,103
usaxs.xray.aps.anl.gov	178,298
democrats.budget.house.gov	175,109
democrats.edworkforce.house.gov	162,077
apps.militaryonesource.mil	157,144
naturalresources.house.gov	155,918
purl.fdlp.gov	154,718
media.dma.mil	137,581
algreen.house.gov	131,388
democrats.transportation.house.gov	129,345
democrats.naturalresources.house.gov	124,808
hanabusa.house.gov	123,794
pitts.house.gov	122,402
visclosky.house.gov	122,223
garamendi.house.gov	114,221
vault.fbi.gov	113,873
green.house.gov	113,040
sewell.house.gov	112,973
levin.house.gov	111,971
eia.gov	111,889
hahn.house.gov	111,024

This last table is a little long, but I found the data pretty interesting to look at. The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.

Subdomain/Domain	EOT2008	EOT2012	Change	Absolute Change	% Change
listserv.access.gpo.gov	2,217,565	7,487	-2,210,078	2,210,078	-100%
carter.house.gov	1,898,462	29,680	-1,868,782	1,868,782	-98%
catalog.gpo.gov	1,868,504	34,040	-1,834,464	1,834,464	-98%
loc.gov	63,534	1,875,264	1,811,730	1,811,730	2,852%
gpo.gov	52,427	1,796,925	1,744,498	1,744,498	3,327%
bensguide.gpo.gov	90,280	1,790,017	1,699,737	1,699,737	1,883%
edocket.access.gpo.gov	1,644,578	7,822	-1,636,756	1,636,756	-100%
nws.noaa.gov	103,367	1,676,264	1,572,897	1,572,897	1,522%
navair.navy.mil	220	1,556,320	1,556,100	1,556,100	707,318%
congress.gov	1,525,467	356	-1,525,111	1,525,111	-100%
cha.house.gov	1,366,520	109,192	-1,257,328	1,257,328	-92%
usbg.gov	1,026,360	6,724	-1,019,636	1,019,636	-99%
dol.gov	1,052,335	41,909	-1,010,426	1,010,426	-96%
resourcescommittee.house.gov	1,008,655	335	-1,008,320	1,008,320	-100%
calvert.house.gov	20,530	1,014,416	993,886	993,886	4,841%
fdlp.gov	989,415	1,554	-987,861	987,861	-100%
lcweb2.loc.gov	466,623	1,451,708	985,085	985,085	211%
cramer.house.gov	1,011,872	60,879	-950,993	950,993	-94%
ed.gov	1,141,069	241,165	-899,904	899,904	-79%
vaccines.mil	5,638	856,113	850,475	850,475	15,085%
clinicaltrials.gov	919,362	193,158	-726,204	726,204	-79%
army.mil	4,831	725,934	721,103	721,103	14,927%
boehner.house.gov	7,472	695,625	688,153	688,153	9,210%
nces.ed.gov	702,644	31,922	-670,722	670,722	-95%
prc.gov	739,849	75,682	-664,167	664,167	-90%
navy.mil	1,481	654,254	652,773	652,773	44,077%
house.gov	818,095	172,066	-646,029	646,029	-79%
fueleconomy.gov	675,522	79,943	-595,579	595,579	-88%
fema.gov	636,005	53,321	-582,684	582,684	-92%
frwebgate.access.gpo.gov	621,361	55,097	-566,264	566,264	-91%
siadapp.dmdc.osd.mil	43	559,076	559,033	559,033	1,300,077%
fdsys.gpo.gov	548,618	28	-548,590	548,590	-100%
tiger.census.gov	549,046	750	-548,296	548,296	-100%
rs6.loc.gov	550,489	6,695	-543,794	543,794	-99%
bennelson.senate.gov	16,203	553,698	537,495	537,495	3,317%
crapo.senate.gov	28,569	540,928	512,359	512,359	1,793%
eia.doe.gov	508,675	1,629	-507,046	507,046	-100%
epa.gov	623,457	117,794	-505,663	505,663	-81%
defenselink.mil	502,006	1,866	-500,140	500,140	-100%
access.gpo.gov	472,373	3,110	-469,263	469,263	-99%

I find this table interesting for a number of reasons. First you see quite a bit more decline that I have seen in my other tables like this. In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.

In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.

Wrapping Up

I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).

There are a few more things I want to look at in this work so stay tuned.

If you have questions or comments about this post, please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – What

This post carries on from where the previous post in this series ended.

A very quick recap, this series is trying to better understand the EOT2008 and the EOT2012 web archives. The goal is to see how they are similar, how they are different, and if there is anything that can be learned that will help us with the upcoming EOT2016 project.

What

The CDX files we are using has a column that contains the Media Type (MIME Type) for the different URIs in the WARC files. A list of the assigned Media Types are available at the International Assigned Numbers Authority (IANA) in their Media Type Registry.

This is a field that is inherently “dirty” for a few reasons. This field is populated from a field in the WARC Record that comes directly from the web server that responded to the initial request. Usually these are fairly accurate but there are many times where they are either wrong or at the least confusing. Often times this is caused by a server administrator, programmer, or system architect that is trying to be clever, or just misconfigured something.

I looked at the Media Types for the two EOT collections to see if there are any major differences between what we collected in the two EOT archives.

In the EOT2008 archive there are a total of 831 unique Mime/Media Types, in the EOT2012 there are a total of 1,208 unique type values.

I took the top 20 Mime/Media Types for each of the archives and pushed them together to see if there was any noticeable change in what we captured between the two archives. In addition to just the raw counts I also looked at what percentage of the archive a given Media Type represented. Finally I noted the overall change in those two percentages.

Media Type	2008 Count	% of Archive	2012 Count	% of Archive	% Change	Change in % of Archive
text/html	105,592,852	65.9%	116,238,952	59.9%	10.1%	-6.0%
image/jpeg	13,667,545	8.5%	24,339,398	12.5%	78.1%	4.0%
image/gif	13,033,116	8.1%	8,408,906	4.3%	-35.5%	-3.8%
application/pdf	10,281,663	6.4%	7,097,717	3.7%	-31.0%	-2.8%
–	4,494,674	2.8%	613,187	0.3%	-86.4%	-2.5%
text/plain	3,907,202	2.4%	3,899,652	2.0%	-0.2%	-0.4%
image/png	2,067,480	1.3%	7,356,407	3.8%	255.8%	2.5%
text/css	841,105	0.5%	1,973,508	1.0%	134.6%	0.5%

Because I like pictures here is a chart of the percent change.

Change in Media Type

If we compare the Media Types between the two archives we find that the two archives share 527 Media Types. The EOT2008 archive has 304 Media Types that aren’t present in EOT2012 and EOT2012 has 681 Media Types that aren’t present in EOT2008.

The ten most frequent Media Types by count found only in the EOT2008 archive are presented below.

Media Type	Count
no-type	405,188
text/x-vcal	17,368
.wk1	8,761
x-text/tabular	5,312
application/x-wp	5,158
*	4,318
x-application/pdf	3,660
application/x-gunzip	3,374
image/x-fits	3,340
WINDOWS-1252	2,304

The ten most frequent Media Types by count found only in the EOT2012 archive are presented below.

Media Type	Count
warc/revisit	12,190,512
application/http	1,050,895
application/x-mpegURL	23,793
img/jpeg	10,466
audio/x-flac	7,251
application/x-font-ttf	7,015
application/x-font-woff	6,852
application/docx	3,473
font/ttf	3,323
application/calendar	2,419

In the EOT2012 archive the team that captured content had fully moved to the WARC format for storing Web archive content. The warc/revisit records are records for URLs that had not changed content-wise across more than one crawl. Instead of storing the URL again, there is a reference to the previously captured content in the warc/revisit record. That’s why there are so many of these Media types.

Below is a table showing the thirty most changed Media Types that are present in both the EOT2008 and EOT2012 archives. You can see both the change in overall numbers as well as the percentage change between the two archives.

Media Type	EOT2008	EOT2012	Change	% Change
image/jpeg	13,667,545	24,339,398	10,671,853	78.1%
text/html	105,592,852	116,238,952	10,646,100	10.1%
image/png	2,067,480	7,356,407	5,288,927	255.8%
image/gif	13,033,116	8,408,906	-4,624,210	-35.5%
–	4,494,674	613,187	-3,881,487	-86.4%
application/pdf	10,281,663	7,097,717	-3,183,946	-31.0%
application/javascript	39,019	1,511,594	1,472,575	3774.0%
text/css	841,105	1,973,508	1,132,403	134.6%
text/xml	344,748	1,433,159	1,088,411	315.7%
unk	4,326	818,619	814,293	18823.2%
application/rss+xml	64,280	731,253	666,973	1037.6%
application/x-javascript	622,958	1,232,306	609,348	97.8%
application/vnd.ms-excel	734,077	212,605	-521,472	-71.0%
text/javascript	69,340	481,701	412,361	594.7%
video/x-ms-asf	26,978	372,565	345,587	1281.0%
application/msword	563,161	236,716	-326,445	-58.0%
application/x-shockwave-flash	192,018	479,011	286,993	149.5%
application/octet-stream	419,187	191,421	-227,766	-54.3%
application/zip	312,872	92,318	-220,554	-70.5%
application/json	1,268	217,742	216,474	17072.1%
video/x-flv	1,448	180,222	178,774	12346.3%
image/jpg	26,421	172,863	146,442	554.3%
application/postscript	181,795	39,832	-141,963	-78.1%
image/x-icon	45,294	164,673	119,379	263.6%
chemical/x-mopac-input	110,324	1,035	-109,289	-99.1%
application/atom+xml	165,821	269,219	103,398	62.4%
application/xml	145,141	246,857	101,716	70.1%
application/x-cgi	100,813	51	-100,762	-99.9%
audio/mpeg	95,613	179,045	83,432	87.3%
video/mp4	1,887	73,475	71,588	3793.7%

Presented as a set of graphs, first showing the change in number of instances of a given Media Type between the two archives.

30 Media Types that changed the most

The second graph is the percentage change between the two archives.

% Change in top 30 mimetypes shared between archives

% Change in top 30 media types shared between archives

Things that stand out are the growth of application/javascript between 2008 and 2012, up 3,774% and application/json that was up over 17,000%. Two formats used to deliver video grew as well with video/x-flv and video/mp4 increasing 12,346% and 3794% respectively.

There were a number of Media Types that reduced in the number and percentage but they are not as dramatic as those identified above. Of note is that between 2008 and 2012 there was a decline of 100% in content with a Media Type of application/x-cgi and a 78% decrease in files that were application/postscript.

Working with the Media Types found in large web archives is a bit messy. While there are standard ways of presenting Media Types to browsers, there are also non-standard, experimental and inaccurate instances of Media Types that will exist in these archives. It does appear that we can see the introduction of some of the newer technologies between the two different archives. Technologies such as the adoption of JSON and Javascript based sites as well as new formats of video on the web.

If you have questions or comments about this post, please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – When

In 2008 a group of institution comprised of the Internet Archive, Library of Congress, California Digital Library, University of North Texas, and Government Publishing Office worked together to collect the web presence of the federal government in a project that has come to be known as the End of Term Presidential Harvest 2008.

Working together this group established the scope of the project, developed a tool to collect nominations of URLs important to the community for harvesting, carried out a harvest of the federal web presence before the election, after the election, and after the inauguration of President Obama. This collection was harvested by the Internet Archive, Library of Congress, California Digital Library, and the UNT Libraries. At the end of the EOT project the data harvested was shared between the partners with several institutions acquiring a copy of the complete EOT dataset for their local collections.

Moving forward four years the same group got together to organize the harvesting of the federal domain in 2012. While originally scoped as a way of capturing the transition of the executive branch, this EOT project also served as a way to systematically capture a large portion of the federal web on a four year calendar. In addition to the 2008 partners, Harvard joined in the project for 2012.

Again the team worked to identify in-scope content to collect, this time however the content included URLs from the social web including Twitter and Facebook for agencies, offices and individuals in the federal government. Because there was not a change in office because of the 2012 election, there was just a set of crawls that occurred during the fall of 2012 and the winter of 2013. Again this content was shared between the project partners interested in acquiring the archives for their own collections.

The End of Term group is a loosely organized group that comes together ever four years to conduct the harvesting of the federal web presence. As we ramp up for the end of the Obama administration the group has started to plan the EOT 2016 project with a goal to start crawling in September of 2016. This time there will be a new president so the crawling will probably take the format of the 2008 crawls with a pre-election, post-election and post-inauguration set of crawls.

So far there hasn’t been much in the way of analysis to compare the EOT2008 and EOT2012 web archives. There are a number of questions that have come up over the years that remain unanswered about the two collections. This series of posts will hopefully take a stab at answering some of those questions and maybe provide better insight into the makeup of these two collections. Finally there are hopefully a few things that can be learned from the different approaches used during the creation of these archives that might be helpful as we begin the EOT 2016 crawling.

Working with the EOT Data

The dataset that I am working with for these posts consists of the CDX files created for the EOT2008 and EOT2012 archive. Each of the CDX files acts as an index to the raw archived content and contains a number of fields that can be useful for analysis. All of the archive content is referenced in the CDX file.

If you haven’t looked at a CDX file in the past here is an example of a CDX file.

gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:martinelli,%20giovanni&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005312 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AMartinelli%2C+Giovanni&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 LFN2AKE4D46XEZNOP3OLXG2WAPLEKZKO - - - 533010532 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:schumann-heink,%20ernestine&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005219 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3ASchumann-Heink%2C+Ernestine&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EL5OT5NAXGGV6VADBLNP2CBZSZ5MH6OT - - - 531160983 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:scotti,%20antonio&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005255 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AScotti%2C+Antonio&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 SEFDA5UNFREPA35QNNLI7DPNU3P4WDCO - - - 804325022 LOC-EOT2012-001-20121125003257404-04183-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:viafora,%20gina&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005309 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AViafora%2C+Gina&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EV6N3TMKIVWAHEHF54M2EMWVM5DP7REJ - - - 532966964 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:homer,%20louise&fq[1]=take_composer_name:campana,%20f.%20&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125070122 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AHomer%2C+Louise&fq%5B1%5D=take_composer_name%3ACampana%2C+F.+&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 FW2IGVNKIQGBUQILQGZFLXNEHL634OI6 - - - 661008391 LOC-EOT2012-001-20121125064213479-04227-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz

The CDX format is a space delimited file with the following fields

SURT formatted URI
Capture Time
Original URI
MIME Type
Response Code
Content Hash (SHA1)
Redirect URL
Meta tags (not populated)
Compressed length (sometimes populated)
Offset in WARC file
WARC File Name

The tools I’m working with to analyze the EOT datasets will consist of Python scripts that either extract specific data from the CDX files where it can be further sorted and counted, or they will be scripts that work on these sorted and counted versions of files.

I’m trying to post code and derived datasets in a Github repository called eot-cdx-analysis if you are interested in taking a look. There is also a link to the original CDX datasets there as well.

How much

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs. Unfortunately the CDX files that we are working with don’t have consistent size information that we can use for analysis but the rough sizes for each of the archives is EOT2008 at 16TB and EOT2012 at just over 41.6TB.

When

The first dimension I wanted to look at was when was the content harvested for each of the EOT rounds. In both cases we all remember starting the harvesting “sometime in September” and then ending the crawls “sometime in March” of the following year. How close were we to our memory?

For this I extracted the Capture Time field from the CDX file, converted that into a date yyyy–mm-dd was a decent bucket to group into and then sorted and counted each instance of a date.

EOT2008 Harvest Dates

This first chart shows the harvest dates contained in the EOT2008 CDX files. Things got kicked off in September 2008 and apparently concluded all the way in OCT 2009. There is another blip of activity in May of 2009. This is probably something to go back and look at to help remember what exactly these two sets of crawling were that happened after March 2009 when we all seem to remember crawling stopping.

EOT2012 Harvest Dates

The EOT2012 crawling started off in mid-September and this time finished up in the first part of March 2013. There is a more consistent shape to the crawling for this EOT with a pretty consistent set of crawling happening between mid-October and the end of January.

EOT2008 and EOT2012 Harvest Dates Compared

When you overlay the two charts you can see how the two compare. Obviously the EOT2008 data continues quite a bit further than the EOT2012 but where they overlap you can see that there were different patterns to the collecting.

Closing

This is the first of a few posts related to web archiving and specifically to comparing the EOT2008 and EOT2012 archives. We are approaching the time to start the EOT2016 crawls and it would be helpful to have more information about what we crawled in the two previous cycles.

In addition to just needing to do this work there will be a presentation on some of these findings as well as other types of analysis at the 2016 Web Archiving and Digital Libraries (WADL) workshop that is happening at the end of JCDL2016 this year in Newark, NJ.

If there are questions you have about the EOT2008 or EOT2012 archives please get in contact with me and we can see if we can answer them.

If you have questions or comments about this post, please let me know via Twitter.

DPLA Description Fields: Language used in descriptions.

This is the last post in a series of posts related to the Description field found in the Digital Public Library of America. I’ve been working with a collection of 11,654,800 metadata records for which I’ve created a dataset of 17,884,946 description fields.

This past Christmas I received a copy of Thing Explainer by Randall Munroe, if you aren’t familiar with this book, Randall uses only the most used ten hundred words (thousand isn’t one of them) to describe very complicated concepts and technologies.

After seeing this book I started to wonder how much of the metadata we create for our digital objects use just the 1,000 most frequent words. Often frequently used words, as well as less complex words (words with fewer syllables) are used in the calculation of the reading level of various texts so that also got me thinking about the reading level required to understand some of our metadata records.

Along that train of thought, one of the things that we hear from aggregations of cultural heritage materials is that K-12 users are a target audience we have and that many of the resources we digitize are with them in mind. With that being said, how often do we take them into account when we create our descriptive metadata?

When I was indexing the description fields I calculated three metrics related to this.

What percentage of the tokens are in the 1,000 most frequently used English words
What percentage of the tokens are in the 5,000 most frequently used English words
What percentage of the tokens are words in a standard English dictionary.

From there I was curious about how the different providers compared to each other.

Average for 1,000, 5,000 and English Dictionary

1,000 most Frequent English Words

The first thing we will look at is the average of amount of a description composed of words from the list of the 1,000 most frequently used English words.

Average percentage of description consisting of 1000 most frequent English words.

For me the providers/hubs that I notice are of course bhl that has very little usage of the 1,000 word vocabulary. This is followed by smithsonian, gpo, hathitrust and uiuc. On the other end of the scale is virginia that has an average of 70%.

5,000 most Frequent English Words

Next up is the average percentage of the descriptions that consist of words from the 5,000 most frequently used English words.

Average percentage of description consisting of 5000 most frequent English words.

This graph ends up looking very much like the 1,000 words graph, just a bit higher percentage wise. This is due to the fact of course that the 5,000 word list includes the 1,000 word list. You do see a few changes in the ordering though, for example gpo switches places with hathitrust in this graph over the 1,000 words graph above.

English Dictionary Words

Next is the average percentage of descriptions that consist of words from a standard English dictionary. Again this includes the 1,000 and 5,000 words in that dictionary so it will be even higher.

Average percentage of description consisting of English dictionary words.

You see that the virginia hub has almost 100% or their descriptions consisting of English dictionary words. The hubs that are the lowest in their use of English words for descriptions are bhl, smithsonian, and nypl.

The graph below has 1,000, 5,000, and English Dictionary words grouped together for each provider/hub so you can see at a glance how they stack up.

1,000, 5,000 most frequent English words and English dictionary words by Provider

Stacked Percent 1,000, 5,000, English Dictionary

Next we will look at the percentages per provider/hub if we group the percentage utilization into 25% buckets. This gives a more granular view of the data than just the averages presented above.

Percentage of descriptions by provider that use 1,000 most frequent English words.

Percentage of descriptions by provider that use 5,000 most frequent English words.

Percentage of descriptions by provider that use English dictionary words.

Closing

I don’t think it is that much of a stretch to draw parallels between the language used in our descriptions and the intended audience of our metadata records. How often are we writing metadata records for ourselves instead of our users? A great example that comes to mind is “verso” or “recto” that we use often for “front” and “back” of items. In the dataset I’ve been using there are 56,640 descriptions with the term “verso” and 5,938 with the term “recto”.

I think we should be taking into account our various audiences when we are creating metadata records. I know this sounds like a very obvious suggestion but I don’t think we really do that when we are creating our descriptive metadata records. Is there a target reading level for metadata records? Should there be?

Looking at the description fields in the DPLA dataset has been interesting. The kind of analysis that I’ve done so far can be seen as kind of a distant reading of these fields. Big round numbers that are pretty squishy and only show the general shape of the field. To dive in and do a close reading of the metadata records is probably needed to better understand what is going on in these records.

Based on experience of mapping descriptive metadata into the Dublin Core metadata fields, I have a feeling that the description field is generally a dumping ground for information that many of us might not consider “description”. I sometimes wonder if it would do our users a greater service by adding a true “note” field to our metadata models so that we have a proper location to dump “notes and other stuff” instead of muddying up a field that should have an obvious purpose.

That’s about it for this work with descriptions, or at least it is until I find some interest in really diving deeper into the data.

If you have questions or comments about this post, please let me know via Twitter.

DPLA Description Fields: More statistics (so many graphs)

In the past few posts we looked at the length of the description fields in the DPLA dataset as a whole and at the provider/hub level.

The length of the description field isn’t the only field that was indexed for this work. In fact I indexed on a variety of different values for each of the descriptions in the dataset.

Below are the fields I currently am working with.

Field	Indexed Value Example
dpla_id	11fb82a0f458b69cf2e7658d8269f179
id	11fb82a0f458b69cf2e7658d8269f179_01
provider_s	usc
desc_order_i	1
description_t	A corner view of the Santa Monica City Hall.; Streetscape. Horizontal photography.
desc_length_i	82
tokens_ss	“A”, “corner”, “view”, “of”, “the”, “Santa”, “Monica”, “City”, “Hall”, “Streetscape”, “Horizontal”, “photography”
token_count_i	12
average_token_length_f	5.5833335
percent_int_f	0
percent_punct_f	0.048780486
percent_letters_f	0.81707317
percent_printable_f	1
percent_special_char_f	0
token_capitalized_f	0.5833333
token_lowercased_f	0.41666666
percent_1000_f	0.5
non_1000_words_ss	“santa”, “monica”, “hall”, “streetscape”, “horizontal”, “photography”
percent_5000_f	0.6666667
non_5000_words_ss	“santa”, “monica”, “streetscape”, “horizontal”
percent_en_dict_f	0.8333333
non_english_words_ss	“monica”, “streetscape”
percent_stopwords_f	0.25
has_url_b	FALSE

This post will try and pull together some of the data from the different fields listed above and present them in a way that we will hopefully be able to use to derive some meaning from.

More Description Length Discussion

In the previous posts I’ve primarily focused on the length of the description fields. There are two other fields that I’ve indexed that are related to the length of the description fields. These two fields include the number of tokens in a description and the average token length of fields.

I’ve included those values below. I’ve included two mean values, one for all of the descriptions in the dataset (17,884,946 descriptions) and in the other the descriptions that are 1 character in length or more (13,771,105descriptions).

Field	Mean – Total	Mean – 1+ length
desc_length_i	83.321	108.211
token_count_i	13.346	17.333
average_token_length_f	3.866	5.020

The graphs below are based on the numbers of just descriptions that are 1+ length or more.

This first graph is being reused from a previous post that shows the average length of description by Provider/Hub. David Rumsey and the Getty are the two that average over 250 characters per description.

Average Description Length by Hub

It shouldn’t surprise you that David Ramsey and Getter are two of the Providers/Hubs that have the highest average token counts, with longer descriptions generally creating more tokens. There are a few differences that don’t match this though, USC that has an average of just over 50 characters for the average description length comes in as the third highest in the average token counts at over 40 tokens per description. There are a few other providers/hubs that look a bit different than their average description length.

Average Token Count by Provider

Below is a graph of the average token lengths by providers. The lower the number is the lower average length of a token. The mean for the entire DPLA dataset for descriptions of length 1+ is just over 5 characters.

Average Token Length by Provider

That’s all I have to say about the various statistics related to length for this post. I swear!. Next we move on to some of the other metrics that I calculated when indexing things.

Other Metrics for the Description Field

Throughout this analysis I had a question of when to take into account that there were millions of records in the dataset that had no description present. I couldn’t just throw away that fact in the analysis but I didn’t know exactly what to do with them. So below I present statistics for the average of many of the fields I indexed as both the mean of all of the descriptions and then the mean of just the descriptions that are one or more characters in length. The graphs that follow the table below are all based on the subset of descriptions that are greater than or equal to one character in length.

Field	Mean – Total	Mean – 1+ length
percent_int_f	12.368%	16.063%
percent_punct_f	4.420%	5.741%
percent_letters_f	50.730%	65.885%
percent_printable_f	76.869%	99.832%
percent_special_char_f	0.129%	0.168%
token_capitalized_f	26.603%	34.550%
token_lowercased_f	32.112%	41.705%
percent_1000_f	19.516%	25.345%
percent_5000_f	31.591%	41.028%
percent_en_dict_f	49.539%	64.338%
percent_stopwords_f	12.749%	16.557%

Stopwords

Stopwords are words that occur very commonly in natural language. I used a list of 127 stopwords for this work to help understand what percentage of a description (based on tokens) is made up of stopwords. While stopwords generally carry little meaning for natural language, they are a good indicator of natural language, so providers/hubs that have a higher percentage of stopwords would probably have more descriptions that resemble natural language.

Percent Stopwords by Provider

Punctuation

I was curious about how much punctuation was present in a description on average. I used the following characters as my set of “punctuation characters”

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I found the number of characters in a description that were made up of these characters vs other characters and then divided the number of punctuation characters by the total description length to get the percentage of the description that is punctuation.

Percent Punctuation by Provider

Punctuation is common in natural language but it occurs relatively infrequently. For example that last sentence was eighty characters long and only one of them was punctuation (the period at the end of the sentence). That comes to a percent_punctuation of only 1.25%. In the graph above you will see the the bhl provider/hub has over 50% of their description with 25-49% punctuation. That’s very high when compared to the other hubs and the fact that there is an average of about 5% overall for the DPLA dataset. Digital Commonwealth has a percentage of descriptions that are from 50-74% punctuation which is pretty interesting as well.

Integers

Next up in our list of things to look at is the percentage of the description field that consists of integers. For review, integers are digits, like the following.

0123456789

I used the same process for the percent integer as I did for the percent punctuation mentioned above.

Percent Integer by Provider

You can see that there are several providers/hubs that have quite a high percentage integer for their descriptions. These providers/hubs are the bhl and the smithsonian. The smithsonian has over 70% of its descriptions with percent integers of over 70%.

Letters

Once we’ve looked at punctuation and integers, that leaves really just letters of the alphabet to makeup the rest of a description field.

That’s exactly what we will look at next. For this I used the following characters to define letters.

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

I didn’t perform any case folding so letters with diacritics wouldn’t be counted as letters in this analysis, but we will look at those a little bit later.

Percent Letter by Provider

For percent letters you would expect there to be a very high percentage of the descriptions that themselves contain a high percentage of letters in the description. Generally this appears to be true but there are some odd providers/hubs again mainly bhl and the smithsonian, though nypl, kdl and gpo also seem to have a different distribution of letters than others in the dataset.

Special Characters

The next thing to look at was the percentage of “special characters” used in a description. For this I used the following definition of “special character”. If a character is not present in the following list of characters (which also includes whitespace characters) then it is considered to be a “special character”

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Percent Special Character by Provider

A note in reading the graph above, keep in mind that the y-axis is only 95-100% so while USC looks different here it only represents 3% of its descriptions that have 50-100% of the description being special characters. Most likely a set of descriptions that have metadata created in a non-english language.

URLs

The final graph I want to look at in this post is the percentage of descriptions for a provider/hub that has a URL present in its description. I used the presence of either http:// or https:// in the description to define if it does or doesn’t have a URL present.

Percent URL by Provider

The majority providers/hubs don’t have URLs in their descriptions with a few obvious exceptions. The provider/hubs of washington, mwdl, harvard, gpo and david_ramsey do have a reasonable number of descriptions with URLs with washington leading with almost 20% of their descriptions having a URL present.

Again this analysis is just looking at what high-level information about the descriptions can tell us. The only metric we’ve looked at that actually goes into the content of the description field to pull out a little bit of meaning is the percent stopwords. I have one more post in this series before we wrap things up and then we will leave descriptions in the DPLA along for a bit.

If you have questions or comments about this post, please let me know via Twitter.

DPLA Descriptive Metadata Lengths: By Provider/Hub

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole. In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time. In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub), sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider	min	max	count	sum	mean	stddev
artstor	0	6,868	128,922	9,413,898	73.02	178.31
bhl	0	100	123,472	775,600	6.28	8.48
cdl	0	6,714	563,964	65,221,428	115.65	211.47
david_rumsey	0	5,269	166,313	74,401,401	447.36	861.92
digital-commonwealth	0	23,455	455,387	40,724,507	89.43	214.09
digitalnc	1	9,785	241,275	45,759,118	189.66	262.89
esdn	0	9,136	197,396	23,620,299	119.66	170.67
georgia	0	12,546	875,158	135,691,768	155.05	210.85
getty	0	2,699	264,268	80,243,547	303.64	273.36
gpo	0	1,969	690,353	33,007,265	47.81	58.20
harvard	0	2,277	23,646	2,424,583	102.54	194.02
hathitrust	0	7,276	4,080,049	174,039,559	42.66	88.03
indiana	0	4,477	73,385	6,893,350	93.93	189.30
internet_archive	0	7,685	523,530	41,713,913	79.68	174.94
kdl	0	974	144,202	390,829	2.71	24.95
mdl	0	40,598	483,086	105,858,580	219.13	345.47
missouri-hub	0	130,592	169,378	35,593,253	210.14	2325.08
mwdl	0	126,427	1,195,928	174,126,243	145.60	905.51
nara	0	2,000	700,948	1,425,165	2.03	28.13
nypl	0	2,633	1,170,357	48,750,103	41.65	161.88
scdl	0	3,362	159,681	18,422,935	115.37	164.74
smithsonian	0	6,076	2,808,334	139,062,761	49.52	137.37
the_portal_to_texas_history	0	5,066	1,271,503	132,235,329	104.00	95.95
tn	0	46,312	151,334	30,513,013	201.63	248.79
uiuc	0	4,942	63,412	3,782,743	59.65	172.44
undefined_provider	0	469	11,436	2,373	0.21	6.09
usc	0	29,861	1,076,031	60,538,490	56.26	193.20
virginia	0	268	30,174	301,042	9.98	17.91
washington	0	1,000	42,024	5,258,527	125.13	177.40

This table is very helpful to reference as we move through the post but it is rather dense. I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia. I also noticed that david_rumsey has the highest average description length of 450 characters. Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average, that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length. The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length. There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length. It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub, there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above. Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions, digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc. I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record. We should remember, “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl. If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at? One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection, I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post, please let me know via Twitter.

DPLA Description Field Analysis: Yes there really are 44 “page” long description fields.

In my previous post I mentioned that I was starting to take a look at the descriptive metadata fields in the metadata collected and hosted by the Digital Public Library of America. That last post focused on records, how many records had description fields present, and how many were missing. I also broke those numbers into the Provider/Hub groupings present in the DPLA dataset to see if there were any patterns.

Moving on the next thing I wanted to start looking at was data related to each instance of the description field. I parsed each of the description fields, calculated a variety of statistics using that description field and then loaded that into my current data analysis tool, Solr which acts as my data store and my full-text index.

After about seven hours of processing I ended up with 17,884,946 description fields from the 11,654,800 records in the dataset. You will notice that we have more descriptions than we do records, this is because a record can have more than one instance of a description field.

Lets take a look at a few of the high-level metrics.

Cardinality

I first wanted to find out the cardinality of the lengths of the description fields. When I indexed each of the descriptions, I counted the number of characters in the description and saved that as an integer in a field called desc_length_i in the Solr index. Once it was indexed, it was easy to retrieve the number of unique values for length that were present. There are 5,287 unique description lengths in the 17,884,946 descriptions that were are analyzing. This isn’t too surprising or meaningful by itself, just a bit of description of the dataset.

I tried to make a few graphs to show the lengths and how many descriptions had what length. Here is what I came up with.

Length of Descriptions in dataset

You can see a blue line barely, the problem is that the zero length records are over 4 million and the longer records are just single instances.

Here is a second try using a log scale for the x axis

Length of Descriptions in dataset (x axis log)

This reads a little better I think, you can see that there is a dive down from zero lengths and then at about 10 characters long there is a spike up again.

One more graph to see what we can see, this time a log-log plot of the data.

Length of Descriptions in dataset (log-log)

Average Description Lengths

Now that we are finished with the cardinality of the lengths, next up is to figure out what the average description length is for the entire dataset. This time the Solr StatsComponent is used and makes getting these statistics a breeze. Here is a small table showing the output from Solr.

min	max	count	missing	sum	sumOfSquares	mean	stddev
0	130,592	17,884,946	0	1,490,191,622	2,621,904,732,670	83.32	373.71

Here we see that the minimum length for a description is zero characters (a record without a description present has a length of zero for that field in this model). The longest record in the dataset is 130,592 characters long. The total number of characters present in the dataset was nearly one and a half billion characters. Finally the number that we were after is the average length of a description, this turns out to be 83.32 characters long.

For those that might be curious what 84 characters (I rounded up instead of down) of description looks like, here is an example.

Aerial photograph of area near Los Angeles Memorial Coliseum, Los Angeles, CA, 1963.

So not a horrible looking length for a description. It feels like it is just about one sentence long with 13 “words” in this sentence.

Long descriptions

Jumping back a bit to look at the length of the longest description field, that description is 130,592 characters long. If you assume that the average single spaced page is 3,000 characters long, this description field is 43.5 pages long. The reader of this post that has spent time with aggregated metadata will probably say “looks like someone put the full-text of the item into the record”. If you’ve spent some serious (or maybe not that serious) time in the metadata mines (trenches?) you would probably mumble somethings like “ContentDM grumble grumble” and you would be right on both accounts. Here is the record on the DPLA site with the 130,492 character long description – http://dp.la/item/40a4f5069e6bf02c3faa5a445656ea61

The next thing I was curious about was the number of descriptions that were “long”. To answer this I am going to require a little bit of back of the envelope freedom right now to decide what “long” is for a description field in a metadata record. (In future blog posts I might be able to answer this with different analysis on the data but this hopefully will do for today.) For now I’m going to arbitrarily decide that anything over 325 characters in length is going to be considered “too long”.

Descriptions: Too Long and Not Too Long

Looking at that pie chart, there are 5.8% of the descriptions that are “too long” based on my ad-hoc metric from above. This 5.8% of the records make up 708,050,671 or 48% of the 1,490,191,622 characters in the entire dataset. I bet if you looked a little harder you would find that the description field gets very close to the 80/20 rule with 20% of the descriptions accounting for 80% of the overall description length.

Short descriptions

Now that we’ve worked with long descriptions, the next thing we should look at are the number of descriptions that are “short” in length.

There are 4,113,841 records that don’t have a description in the DPLA dataset. This means that for this analysis 4,113,841(23%) of the descriptions have a length of 0. There are 2,041,527 (11%) descriptions that have a length between 1 and 10 characters in length. Below is the breakdown of these ten counts, you can see that there is a surprising number (777,887) of descriptions that have a single character as their descriptive contribution to the dataset.

Descriptions 10 characters or less

There is also an interesting spike at ten characters in length where suddenly we jump to over 500,000 descriptions in the DPLA.

So what?

Now that we have the average length of a description in the DPLA dataset, the number of records that we consider “long” and the number of records that we consider “short”. I think the very next question that gets asked is “so what?”

I think there are four big reasons that I’m working on this kind of project with the DPLA data.

One is that the DPLA is the largest aggregation of descriptive metadata in the US for digital resources in cultrual heritage institutions. This is important because you get to take a look at a wide variety of data input rules, practices, and conversions from local systems to an aggregated metadata system.

Secondly this data is licensed with a CC0 license and in a bulk data format so it is easy to grab the data and start working with it.

Thirdly there haven’t been that many studies on descriptive metadata like this that I’m aware of. OCLC will publish analysis on their MARC catalog data from time to time, and the research that was happening at UIUC in the GSILS with IMLS funded metadata isn’t going on anymore (great work to look at by the way) so there really aren’t that many discussions about using large scale aggregations of metadata to understand the practices in place in cultural heritage institutions across the US. I am pretty sure that there is work being carried out across the Atlantic with the Eureopana datasets that are available.

Finally I think that this work can lead to metadata quality assurance practices and indicators for metadata creators and aggregators about what may be wrong with their metadata (a message saying “your description is over a page long, what’s up with that?”).

I don’t think there are many answers so far in this work but I feel that they are moving us in the direction of a better understanding of our descriptive metadata world in the context of these large aggregations of metadata.

If you have questions or comments about this post, please let me know via Twitter.

Beginning to look at the description field in the DPLA

Last year I took a look at the subject field and the date fields in the Digital Public Library of America (DPLA). This time around I wanted to begin looking at the description field and see what I could see.

Before diving into the analysis, I think it is important to take a look at a few things. First off, when you reference the DPLA Metadata Application Profile v4, you may notice that the description field is not a required field, in fact the field doesn’t show up in APPENDIX B: REQUIRED, REQUIRED IF AVAILABLE, AND RECOMMENDED PROPERTIES. From that you can assume that this field is very optional. Also, the description field when present is often used to communicate a variety of information to the user. The DPLA data has examples that are clearly rights statements, notes, physical descriptions of the item, content descriptions of the item, and in some instances a place to store identifiers or names. Of all of the fields that one will come into contact in the DPLA dataset, I would image that the description field is probably one of the ones with the highest variability of content. So with that giant caveat, let’s get started.

So on to the data.

The DPLA makes available a data dump of the metadata in their system. Last year I was analyzing just over 8 million records, this year the collection has grown to more than 11 million records ( 11,654,800 in the dataset I’m using).

The first thing that I had to accomplish was to pull out just the descriptions from the full json dataset that I downloaded. I was interested in three values for each record, specifically the Provider or “Hub”, the DPLA identifier for the item and finally the description fields. I finally took the time to look at jq, which made this pretty easy.

For those that are interested here is what I came up with to extract the data I wanted.

zcat all.json.gz | jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (._source.id), 'descriptions': ._source.sourceResource.description?}'

This results in an output that look like this.

{"provider":"http://dp.la/api/contributor/cdl","id":"4fce5c56d60170c685f1dc4ae8fb04bf","descriptions":["Lang: Charles Aikin Collection"]}
{"provider":"http://dp.la/api/contributor/cdl","id":"bca3f20535ed74edb20df6c738184a84","descriptions":["Lang: Maire, graveur."]}
{"provider":"http://dp.la/api/contributor/cdl","id":"76ceb3f9105098f69809b47aacd4e4e0","descriptions":null}
{"provider":"http://dp.la/api/contributor/cdl","id":"88c69f6d29b5dd37e912f7f0660c67c6","descriptions":null}

From there my plan was to write some short python scripts that can read a line, convert it from json into a python object and then do programmy stuff with it.

Who has what?

After parsing the data a bit I wanted to remind myself of the spread of the data in the DPLA collection. There is a page on the DPLA’s site http://dp.la/partners/ that shows you how many records have been contributed by which Hub in the network. This is helpful but I wanted to draw a bar graph to give a visual representation of this data.

DPLA Partner Records

As has been the case since it was added, Hathitrust is the biggest provider of records to the DPLA with other 2.4 million records. Pretty amazing!

There are three other Hubs/Providers that contribute over 1 million records each, The Smithsonian, New York Public Library, and the University of Southern California Libraries. Down from there there are three more that contribute over half a million records, Mountain West Digital Library, National Archives and Records Administration (NARA) and The Portal to Texas History.

There were 11,410 records (coded as undefined_provider) that are not currently associated with a Hub/Provider, probably a data conversion error somewhere during the record ingest pipeline.

Which have descriptions

After the reminder about the size and shape of the Hubs/Providers in the DPLA dataset, we can dive right into the data and see quickly how well represented in the data the description field is.

We can start off with another graph.

Percent of Hubs/Providers with and without descriptions

You can see that some of the Hubs/Providers have very few records (< 2%) with descriptions (Kentucky Digital Library, NARA) while others had a very high percentage (> 95%) of records with description fields present (David Rumsey, Digital Commonwealth, Digital Library of Georgia, J. Paul Getty Trust, Government Publishing Office, The Portal to Texas History, Tennessee Digital Library, and the University of Illinois at Urbana-Champaign).

Below is a full breakdown for each Hub/Provider showing how many and what percentage of the records have zero descriptions, or one or more descriptions.

Provider	Records	0 Descriptions	1+ Descriptions	0 Descriptions %	1+ Descriptions %
artstor	107,665	40,851	66,814	37.94%	62.06%
bhl	123,472	64,928	58,544	52.59%	47.41%
cdl	312,573	80,450	232,123	25.74%	74.26%
david_rumsey	65,244	168	65,076	0.26%	99.74%
digital-commonwealth	222,102	8,932	213,170	4.02%	95.98%
digitalnc	281,087	70,583	210,504	25.11%	74.89%
esdn	197,396	48,660	148,736	24.65%	75.35%
georgia	373,083	9,344	363,739	2.50%	97.50%
getty	95,908	229	95,679	0.24%	99.76%
gpo	158,228	207	158,021	0.13%	99.87%
harvard	14,112	3,106	11,006	22.01%	77.99%
hathitrust	2,474,530	1,068,159	1,406,371	43.17%	56.83%
indiana	62,695	18,819	43,876	30.02%	69.98%
internet_archive	212,902	40,877	172,025	19.20%	80.80%
kdl	144,202	142,268	1,934	98.66%	1.34%
mdl	483,086	44,989	438,097	9.31%	90.69%
missouri-hub	144,424	17,808	126,616	12.33%	87.67%
mwdl	932,808	57,899	874,909	6.21%	93.79%
nara	700,948	692,759	8,189	98.83%	1.17%
nypl	1,170,436	775,361	395,075	66.25%	33.75%
scdl	159,092	33,036	126,056	20.77%	79.23%
smithsonian	1,250,705	68,871	1,181,834	5.51%	94.49%
the_portal_to_texas_history	649,276	125	649,151	0.02%	99.98%
tn	151,334	2,463	148,871	1.63%	98.37%
uiuc	18,231	127	18,104	0.70%	99.30%
undefined_provider	11,422	11,410	12	99.89%	0.11%
usc	1,065,641	852,076	213,565	79.96%	20.04%
virginia	30,174	21,081	9,093	69.86%	30.14%
washington	42,024	8,838	33,186	21.03%	78.97%

With so many of the Hub/Providers having a high percentage of records with descriptions, I was curious about the overall records in the DPLA. Below is a pie chart that shows you what I found.

DPLA records with and without descriptions

Almost 2/3 of the records in the DPLA have at least one description field, this is more than I would have expected for an un-required, un-recommended field, but I think this is probably a good thing.

Descriptions per record

The final thing I wanted to look at in this post was the average number of description fields for each of the Hubs/Providers. This time we will start off with the data table below.

Provider	Providers	min	median	max	mean	stddev
artstor	107,665	0	1	5	0.82	0.84
bhl	123,472	0	0	1	0.47	0.50
cdl	312,573	0	1	10	1.55	1.46
david_rumsey	65,244	0	3	4	2.55	0.80
digital-commonwealth	222,102	0	2	17	2.01	1.15
digitalnc	281,087	0	1	19	0.86	0.67
esdn	197,396	0	1	1	0.75	0.43
georgia	373,083	0	2	98	2.32	1.56
getty	95,908	0	2	25	2.75	2.59
gpo	158,228	0	4	65	4.37	2.53
harvard	14,112	0	1	11	1.46	1.24
hathitrust	2,474,530	0	1	77	1.22	1.57
indiana	62,695	0	1	98	0.91	1.21
internet_archive	212,902	0	2	35	2.27	2.29
kdl	144,202	0	0	1	0.01	0.12
mdl	483,086	0	1	1	0.91	0.29
missouri-hub	144,424	0	1	16	1.05	0.70
mwdl	932,808	0	1	15	1.22	0.86
nara	700,948	0	0	1	0.01	0.11
nypl	1,170,436	0	0	2	0.34	0.47
scdl	159,092	0	1	16	0.80	0.41
smithsonian	1,250,705	0	2	179	2.19	1.94
the_portal_to_texas_history	649,276	0	2	3	1.96	0.20
tn	151,334	0	1	1	0.98	0.13
uiuc	18,231	0	3	25	3.47	2.13
undefined_provider	11,422	0	0	4	0.00	0.08
usc	1,065,641	0	0	6	0.21	0.43
virginia	30,174	0	0	1	0.30	0.46
washington	42,024	0	1	1	0.79	0.41

This time with an image

Average number of descriptions per record

You can see that there are several Hubs/Providers a have multiple descriptions per record, with the Government Publishing Office coming in at 4.37 descriptions per record.

I found it interesting that when you exclude the two Hubs/Providers that don’t really do descriptions (KDL and NARA) you see two that have a very low standard deviation from their mean (average) Tennessee Digital Library at 0.13 and The Portal to Texas History at 0.20 don’t drift much from their almost one description-per-record for Tennessee and almost two descriptions-per-record for Texas. It makes me think that this is probably a set of records that each of those Hubs/Providers would like to have identified so they could go in and add a few descriptions.

Closing

Well that wraps up this post that I hope is the first in a series of posts about the description field in the DPLA dataset. In subsequent posts we will move away from record level analysis of description fields and get down to the field level to do some analysis of the descriptions themselves. I have a number of predictions but I will hold onto those for now.

If you have questions or comments about this post, please let me know via Twitter.

How many of the EOT2008 PDF files were harvested in EOT2012

In my last post I started looking at some of the data from the End of Term 2012 Web Archive snapshot that we have at the UNT Libraries. For more information about EOT2012 take a look at that previous post.

EOT2008 PDFs

From the EOT2008 Web archive I had extracted the 4,489,675 unique (by hash) PDF files that were present and carried out a bit of analysis on them as a whole to see if there was anything interesting I could tease out. The results of that investigation I presented at an IS&T Archiving conference a few years back. The text from the proceedings for that submission is here and the slides presented are here.

Moving forward several years, I was curious to see how many of those nearly 4.5 million PDFs were still around in 2012 when we crawled the federal Web again as part of the EOT2012 project.

I used the same hash dataset from the previous post to do this work which made things very easy. I first pulled the hash values for the 4.489,675 PDF files from EOT2008. Next I loaded all of the hash values from the EOT2012 crawls. The next and final step was to iterate through each of the PDF file hashes and do a lookup to see if that content hash is present in the EOT2012 hash dataset. Pretty straightforward.

Findings

After the numbers finished running, it looks like we have the following.

	PDFs	Percentage
Found	774,375	17%
Missing	3,715,300	83%
Total	4,489,675	100%

Put into a pie chart where red equals bad.

EOT2008 PDFs in EOT2012 Archive

So 83% of the PDF files that were present in 2008 are not present in the EOT2012 Archive.

With a little work it wouldn’t be hard to see how many of these PDFs are still present on the web today at the same URL as in 2008. I would imagine it is a much smaller number than the 17%.

A thing to note about this is that because I am using content hashes and not URLs, it is possible that an EOT2008 PDF is available at a different URL entirely in 2012 when it was harvested again. So the URL might not be available but the content could be available at another location.

If you have questions or comments about this post, please let me know via Twitter.