First step analysis of Library of Congress Name Authority File

For a class this last semester I spent a bit of time working with the Library of Congress Name Authority File (LC-NAF) that is available here in a number of downloadable formats.

After downloading the file and extracting only the parts I was interested in, I was left with 7,861,721 names to play around with.

The resulting dataset has three columns, the unique identifier for a name, the category of either PersonalName or CorporateName and finally the authoritative string for the given name.

Here is an example set of entries in the dataset.

<http://id.loc.gov/authorities/names/no2015159973> PersonalName Thomas, Mike, 1944-
<http://id.loc.gov/authorities/names/n00004656> PersonalName Gutman, Sharon A.
<http://id.loc.gov/authorities/names/no99024929> PersonalName Hornby, Lester G. (Lester George), 1882-1956
<http://id.loc.gov/authorities/names/n86050616> PersonalName Borisi\uFE20u\uFE21k, G. N. (Galina Nikolaevna)
<http://id.loc.gov/authorities/names/no2011132525> PersonalName Cope, Samantha
<http://id.loc.gov/authorities/names/nr92002092> PersonalName Okuda, Jun
<http://id.loc.gov/authorities/names/n2008028760> PersonalName Brandon, Wendy
<http://id.loc.gov/authorities/names/no2008088468> PersonalName Gminder, Andreas
<http://id.loc.gov/authorities/names/nb2013005548> CorporateName Archivo Hist\u00F3rico Provincial de Granada
<http://id.loc.gov/authorities/names/n84081250> PersonalName Mermier, Pierre-Marie, 1790-1862

I was interested in how Personal and Corporate names differ across the whole LC-NAF file and to see if there were any patterns that I could tease out. The final goal if I could train a classifier to automatically classify a name string into either PersonalName or CorporateName classes.

But more on that later.

Personal or Corporate Name

The first thing to take a look at in the dataset is the split between PersonalName and CorporateName strings.

LC-NAF Personal / Corporate Name Distribution

As you can see the majority of names in the LC-NAF are personal names with 6,361,899 (81%) and just 1,499,822 (19%) being corporate names.

Commas

One of the common formatting rules in library land is to invert names so that they are in the format of Last, First.  This is useful when sorting names as it will group names together by family name instead of ordering them by the first name.  Because of this common rule I expected that the majority of the personal names will have a comma.  I wasn’t sure what number of the corporate names would have a comma in them.

Distribution of Commas in Name Strings

In looking at the graph above you can see that it is true that the majority of personal names have commas 6,280,219 (99%) with a much smaller set of corporate names 213,580 (14%) having a comma present.

Periods

I next took a look at periods in the name string.  I wasn’t sure exactly what I would find in doing this so my only prediction was that there would be fewer name strings that have periods present.

Distribution of Periods in Name Strings

This time we see a bit different graph.  Personal names have1,587,999 (25%) instances with periods while corporate names had 675,166 (45%) instances with periods.

Hyphens

Next up to look at are hyphens that occur in name strings.

Distribution of Hyphens in Name Strings

There are 138,524 (9%) of corporate names with hyphens and 2,070,261 (33%) of personal names with hyphens present in the name string.

I know that there are many name strings in the LC-NAF that have dates in the format of yyyy-yyyy, yyyy-, or -yyyy. Let’s see how many name strings have a hyphen when we remove those.

Date and Non-Date Hyphens

This time we look at the instances that just have hyphens and divide them into two categories. “Date Hyphens” and “Non-Date Hyphens”.  You can see that most of the corporate name strings have hyphens that are not found in relation to dates.  The personal names on the other hand have the majority of hyphens occurring in date strings.

Parenthesis

The final punctuation characters we will look at are parenthesis.

Distribution of Parenthesis in Name Strings

We see that most names overall don’t have parenthesis in them.  There are 472,254 (31%) name strings in the dataset with parenthesis. There are also 541,087 (9%) of personal name strings that have parenthesis.

This post is the first in a short series that takes a look at the LC Name Authority File to get a better understanding of how names in library metadata have been constructed over the years.

If you have questions or comments about this post,  please let me know via Twitter.

Removing leading or trailing white rows from images

At the library we are working on a project to digitize television news scripts from KXAS, the NBC affiliate from Fort Worth.  These scripts were read on the air during the broadcast and are a great entry point into a vast collection of film and tape collection that is housed at the UNT Libraries.

To date we’ve digitized and made available over 13,000 of these scripts.

In looking at workflows we noticed that sometimes the scanners and scanning software would leave several rows of white pixels at the leading or trailing end of the image.

It is kind of hard to see that because this page has a white background so I’ll include a closeup for you.  I put a black border around the image to help the white stand out a bit.

Detail of leading white edge

One problem with these white rows is that they happen some of the time but not all of the time.  Another problem is that the number of white lines isn’t uniform, it will vary from image to image when it occurs. The final problem is that it is not consistently at the top or at the bottom of the image. It could be at the top, the bottom, or both.

Probably the best solution to this problem is going to be getting different control software for the scanners that we are using.  But that won’t help the tens of thousands of these image that we have already scanned and that we need to process.

Trimming white line

Manual

There are a number of ways that we can approach this task.  First we can do what we are currently doing which is to have our imaging students open each image and manually crop them if needed.  This is very time consuming.

Photoshop

There is a tool in photoshop that can sometimes be useful for this kind of work.  It is the “Trim” tool.  Here is the dialog box you get when you select this tool.

Photoshop Trim Dialog Box

This allows you to select if you want to remove from the top of bottom (or left or right).  The tool wants you to select a place on the image to grab a color sample and then it will try and trim off rows of the image that match that color.

Unfortunately this wasn’t an ideal solution because you still had to know if you needed to crop from the top or bottom.

Imagemagick

Imagemagick tools have an option called “trim” that does a very similar thing to the Photoshop Trim tool.  It is well described on this page.

By default the trim option here will remove edges around the whole image that match a pixel value.  You are able to adjust the specificity of the pixel color if you add a little blur but it isn’t an ideal solution either.

A little Python

My next thing to look at was to use a bit of Python to identify the number of rows in an image that are white.

With this script you feed it an image filename and it will return the number of rows from the top of the image that are at least 90% white.

The script will convert the incoming image into a grayscale image, and then line by line count the number of pixels that have a pixel value greater than 225 (so a little white all the way to white white).  It will then count a line as “white” if more than 90% of the pixels on that line have a value of greater than 225.

Once the script reaches a row that isn’t white, it ends and returns the number of white lines it has found.  If the first row of the image is not a white row it will immediately return with a value of 0.

The next thing to go back to Imagemagick but this time use the -chop flag to remove the number of rows from the image that the previous script specified.

mogrify -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

We tell mogrify to chop off the first fifteen rows of the image with the 0x15 value.  This means an offset of zero and then remove fifteen rows of pixels.

Here is what the final image looks like without the leading white edge.

Corrected image

In order to count the rows from the bottom you have to adjust the script in one place.  Basically you reverse the order of the rows in the image so  you work from the bottom first.  This allows you to apply the same logic to finding white rows as we did before.

You have to adjust the Imagemagick command as well so that you are chopping the rows from the bottom of the image and not the top.  You do this by specifying -gravity in the command.

mogrify -gravity bottom -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

With a little bit of bash scripting these scripts can be used to process a whole folder full of images and instructions can be given to only process images that have rows that need to be removed.

This combination of a small Python script to gather image information and then passing that info on to Imagemagick has been very useful for this project and there are a number of other ways that this same pattern can be used for processing images in a digital library workflow.

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – Curator Intent

This is another post in a series that I’ve been doing to compare the End of Term Web Archives from 2008 and 2012.  If you look back a few posts in this blog you will see some other analysis that I’ve done with the datasets so far.

One thing that I am interested in understanding is how well the group that conducted the EOT crawls did in relation to what I’m calling “curator intent”.  For both the EOT archives suggested seeds were collected using instances of the URL Nomination Tool hosted by the UNT Libraries. A combination of bulk lists of seeds URLs collected by various institutions and individuals were combined individual nominations made by users of the nomination tool.  The resulting lists were used as seed lists for the crawlers that were used to harvest the EOT archives.  In 2008 there were four institutions that crawled content,  the Internet Archive (IA), Library of Congress (LOC), California Digital Library (CDL), and the UNT Libraries (UNT).  In 2012 CDL was not able to do any crawling so just IA, LOC and UNT crawled.  UNT and LOC had limited scope in what they were interested in crawling while CDL and IA took the entire seed list and used that to feed their crawlers.  The crawlers were scoped very wide so that they would get as much content as they could, so the nomination seeds were used as starting places and we allowed the crawlers to go to all subdomains and paths on those sites as well as to areas that the sites linked to on other domains.

During the capture period there wasn’t consistent quality control performed for the crawls, we accepted what we could get and went on with our business.

Looking back at the crawling that we did I was curious of two things.

  1. How many of the domain names from the nomination tool were not present in the EOT archive.
  2. How many domains from .gov and .mil were captured but not explicitly nominated.

EOT2008 Nominated vs Captured Domains.

In the 2008 nominated URL list form the URL Nomination Tool there were a total of 1,252 domains with 1,194 being either .gov or .mil.  In the EOT2008 archive there were a total of 87,889 domains and 1,647 of those were either .gov or .mil.

There are 943 domains that are present in both the 2008 nomination list and the EOT2008 archive.  There are 251 .gov or .mil domains from the nomination list that were not present in the EOT2008 archive. There are 704 .gov or .mil domains that are present in the EOT2008 archive but that aren’t present in the 2008 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2008 .gov and .mil Nominated and Archived

2008 .gov and .mil Nominated and Archived

Of those 704 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs
womenshealth.gov 168,559
dccourts.gov 161,289
acquisition.gov 102,568
america.gov 89,610
cfo.gov 83,846
kingcounty.gov 61,069
pa.gov 42,955
dc.gov 28,839
inl.gov 23,881
nationalservice.gov 22,096
defenseimagery.mil 21,922
recovery.gov 17,601
wa.gov 14,259
louisiana.gov 12,942
mo.gov 12,570
ky.gov 11,668
delaware.gov 10,124
michigan.gov 9,322
invasivespeciesinfo.gov 8,566
virginia.gov 8,520
alabama.gov 6,709
ct.gov 6,498
idaho.gov 6,046
ri.gov 5,810
kansas.gov 5,672
vermont.gov 5,504
arkansas.gov 5,424
wi.gov 4,938
illinois.gov 4,322
maine.gov 3,956

I see quite a few state and local governments that have a .gov domain which was out of scope of the EOT project but there are also a number of legitimate domains in the list that were never nominated.

EOT2012 Nominated vs Captured Domains.

In the 2012 nominated URL list form the URL Nomination Tool there were a total of 1,674 domains with 1,551 of those being .gov or .mil domains.  In the EOT2012 archive there were a total of 186,214 domains and 1,944 of those were either .gov or .mil.

There are 1,343 domains that are present in both the 2008 nomination list and the EOT2012 archive.  There are 208 .gov or .mil domains from the nomination list that were not present in the EOT2012 archive. There are 601 .gov or .mil domains that are present in the EOT2012 archive but that aren’t present in the 2012 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2012 .gov and .mil Domains Nominated and Archived

2012 .gov and .mil Domains Nominated and Archived

Of those 601 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs
gao.gov 952,654
vaccines.mil 856,188
esgr.mil 212,741
fdlp.gov 156,499
copyright.gov 70,281
congress.gov 40,338
openworld.gov 31,929
americaslibrary.gov 18,415
digitalpreservation.gov 17,327
majorityleader.gov 15,931
sanjoseca.gov 10,830
utah.gov 9,387
dc.gov 9,063
nyc.gov 8,707
ng.mil 8,199
ny.gov 8,185
wa.gov 8,126
in.gov 8,011
vermont.gov 7,683
maryland.gov 7,612
medicalmuseum.mil 7,135
usbg.gov 6,724
virginia.gov 6,437
wv.gov 6,188
compliance.gov 6,181
mo.gov 6,030
idaho.gov 5,880
nv.gov 5,709
ct.gov 5,628
ne.gov 5,414

Again there are a number of state and local government domains present in the list but up at the top we see quite a few URLs harvested from domains that are federal in nature and would fit into the collection scope for the EOT project.

How did we do?

The way that seed lists for the nomination tool were collected for the EOT2008 and EOT2012 nomination lists introduced a bit of dirty data.  We would need to look a little deeper to see what the issues were with these. Some things that come to mind are that we got seeds from domains that existed prior to 2008 or 2012 but that didn’t exist when we were harvesting.  Also there could have been typos in the URLs that were nominated so we never grabbed the suggested content.  We might want to introduce a validate process for the nomination tool that let’s us know what that status of a URL in a project is at a given point so that we can at least have some sort of record.

 

 

 

13% to 10%

Comparing Web Archives: EOT2008 and EOT2012 – What disappeared

This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012.  In previous posts I’ve looked at the when, what, and where of these archives.  In doing so I pulled together the domain names from each of the archives to compare them.

My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012.  Likewise I could look at new content domains that appeared since 2008.  For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.

Comparing EOT2008 and EOT2012

The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common.  There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.

Domains in EOT2008 and E0T2012

Domains in EOT2008 and E0T2012

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs.  When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.

The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.

Domain Count
geodata.gov 812,524
nifl.gov 504,910
stat-usa.gov 398,961
tradestatsexpress.gov 243,729
arnet.gov 174,057
acqnet.gov 171,493
dccourts.gov 161,289
web-services.gov 137,202
metrokc.gov 132,210
sdi.gov 91,887
davie-fl.gov 88,123
belmont.gov 87,332
aftac.gov 84,507
careervoyages.gov 57,192
women-21.gov 56,255
egrpra.gov 54,775
4women.gov 45,684
4woman.gov 42,192
nypa.gov 36,099
nhmfl.gov 27,569
darpa.gov 21,454
usafreedomcorps.gov 18,001
peacecore.gov 17,744
californiadesert.gov 15,172
arpa.gov 15,093
okgeosurvey1.gov 14,595
omhrc.gov 14,594
usafreedomcorp.gov 14,298
uscva.gov 13,627
odci.gov 12,920

The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.

Domain Count
militaryonesource.mil 859,843
consumerfinance.gov 237,361
nrd.gov 194,215
wh.gov 179,233
pnnl.gov 132,994
eia.gov 112,034
transparency.gov 109,039
nationalguard.mil 108,854
acus.gov 93,810
404.gov 82,409
savingsbondwizard.gov 76,867
treasuryhunt.gov 76,394
fedshirevets.gov 75,529
onrr.gov 75,484
veterans.gov 75,350
broadbandmap.gov 72,889
saferproducts.gov 65,387
challenge.gov 63,808
healthdata.gov 63,105
marinecadastre.gov 62,882
fatherhood.gov 62,132
edpubs.gov 58,356
transportationresearch.gov 58,235
cbca.gov 56,043
usbonds.gov 55,102
usbond.gov 54,847
phe.gov 53,626
ussavingsbond.gov 53,563
scienceeducation.gov 53,468
mda.gov 53,010

Shared domains that changed

There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives.  I thought it would be interesting to compare those domains and see which ones changed the most.  Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.

Domain EOT2008 EOT2012 Change Absolute Change % Change
house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162%
senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97%
gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55%
nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76%
nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54%
navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739%
usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65%
loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50%
hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85%
osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073%
af.mil 230,920 2,067,812 1,836,892 1,836,892 795%
ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78%
lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85%
usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54%
congress.gov 1,554,199 40,338 -1,513,861 1,513,861 -97%
noaa.gov 5,317,872 3,985,633 -1,332,239 1,332,239 -25%
epa.gov 1,628,517 327,810 -1,300,707 1,300,707 -80%
uscourts.gov 1,484,240 184,507 -1,299,733 1,299,733 -88%
dol.gov 1,387,724 88,557 -1,299,167 1,299,167 -94%
census.gov 1,604,505 328,014 -1,276,491 1,276,491 -80%
dot.gov 1,703,935 554,325 -1,149,610 1,149,610 -67%
usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99%
doe.gov 1,164,955 268,694 -896,261 896,261 -77%
vaccines.mil 5,665 856,188 850,523 850,523 15,014%
fdlp.gov 991,747 156,499 -835,248 835,248 -84%
uspto.gov 980,215 155,428 -824,787 824,787 -84%
bts.gov 921,756 130,730 -791,026 791,026 -86%
cdc.gov 1,014,213 264,500 -749,713 749,713 -74%
lbl.gov 743,472 4,080 -739,392 739,392 -99%
faa.gov 945,446 206,500 -738,946 738,946 -78%
treas.gov 838,243 99,411 -738,832 738,832 -88%
fema.gov 903,393 172,055 -731,338 731,338 -81%
clinicaltrials.gov 919,490 196,642 -722,848 722,848 -79%
army.mil 2,228,691 2,936,308 707,617 707,617 32%
nsf.gov 760,976 65,880 -695,096 695,096 -91%
prc.gov 740,176 75,682 -664,494 664,494 -90%
doc.gov 823,825 173,538 -650,287 650,287 -79%
fueleconomy.gov 675,522 79,943 -595,579 595,579 -88%
nbii.gov 577,708 391 -577,317 577,317 -100%
defense.gov 687 575,776 575,089 575,089 83,710%
usajobs.gov 3,487 551,217 547,730 547,730 15,708%
sandia.gov 736,032 210,429 -525,603 525,603 -71%
nps.gov 706,323 191,102 -515,221 515,221 -73%
defenselink.mil 502,023 1,868 -500,155 500,155 -100%
fws.gov 625,180 132,402 -492,778 492,778 -79%
ssa.gov 609,784 125,781 -484,003 484,003 -79%
archives.gov 654,689 175,585 -479,104 479,104 -73%
fnal.gov 575,167 1,051,926 476,759 476,759 83%
change.gov 486,798 24,820 -461,978 461,978 -95%
buyusa.gov 490,179 37,053 -453,126 453,126 -92%

Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.

Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content.  I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest.  I suspect it is some combination of all three of those things.

Summary

It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government.  One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization.  For example “what domains did the federal government have in 1998?”.  It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012.  Take a look at that list of the top thirty,  did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – Where

This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.

In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.

Top Level Domains

The first thing to look at is the top level domains for all of the URLs in the CDX files.

In the EOT2008 archive there are a total of 241 unique TLDs.  In the EOT2012 archive there are a total of 251 unique TLDs.  This is a modest increase of 4.15% from EOT2008 to EOT2012.

The EOT2008 and EOT2012 archives share 225 TLDs between the two archives.  There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.

TLDs unique to EOT2008

Unique to 2008 URLs from TLD
null 18,772
www 583
yu 357
labs 20
webteam 16
cg 10
security 8
ssl 8
b 8
css 7
web 6
dev 4
education 4
misc 2
secure 2
campaigns 2

TLDs unique to EOT2012

Unique to 2012 URLs from TLD
whois 17,500
io 7,935
pn 987
sy 541
lr 478
so 418
nr 363
tf 291
xxx 258
re 186
xn--p1ai 171
bi 153
dm 120
tel 78
ck 65
ax 64
sx 54
tg 50
ki 48
gg 25
kn 25
gp 24
pm 20
fk 18
cf 7
wf 3

I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008.  I will have to do some digging to confirm.

Change in TLD

Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD

TLD eot2008 eot2012 Change Absolute Change % change
com 7,809,711 45,594,482 37,784,771 37,784,771 483.8%
gov 137,829,050 109,141,353 -28,687,697 28,687,697 -20.8%
mil 3,555,425 16,223,861 12,668,436 12,668,436 356.3%
net 653,187 9,269,406 8,616,219 8,616,219 1319.1%
edu 3,552,509 2,442,626 -1,109,883 1,109,883 -31.2%
int 135,939 685,168 549,229 549,229 404.0%
uk 70,262 594,020 523,758 523,758 745.4%
ly 95 503,457 503,362 503,362 529854.7%
org 5,108,645 5,588,750 480,105 480,105 9.4%
us 840,516 474,156 -366,360 366,360 -43.6%
co 2,839 211,131 208,292 208,292 7336.8%
be 4,019 203,178 199,159 199,159 4955.4%
jp 23,896 220,602 196,706 196,706 823.2%
me 35 182,963 182,928 182,928 522651.4%
tv 10,373 191,736 181,363 181,363 1748.4%

Interesting is the change in the first two.  There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012.  There was also a decrease (-21%) or over 28 million URLs for the gov TLD.  The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.

You can see that .ly and .me increased by some serious percentage,  529,855% and 522,651% respectively.

Taking a look at just the percent of change, here are the five most changed based on that percentage

TLD eot2008 eot2012 Change Absolute Change % change
ly 95 503,457 503,362 503,362 529854.7%
me 35 182,963 182,928 182,928 522651.4%
gl 129 49,733 49,604 49,604 38452.7%
gd 9 3,273 3,264 3,264 36266.7%
cat 43 11,703 11,660 11,660 27116.3%

I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.

Domain Names

There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.

There are 30,066 domain names that are shared between the two archives.  There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.

Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.

TLD Count
geodata.gov 812,524
nifl.gov 504,910
stat-usa.gov 398,961
tradestatsexpress.gov 243,729
arnet.gov 174,057
acqnet.gov 171,493
dccourts.gov 161,289
meish.org 147,261
web-services.gov 137,202
metrokc.gov 132,210
sdi.gov 91,887
davie-fl.gov 88,123
belmont.gov 87,332
aftac.gov 84,507
careervoyages.gov 57,192
women-21.gov 56,255
egrpra.gov 54,775
4women.gov 45,684
4woman.gov 42,192
nypa.gov 36,099
secure-banking.com 33,059
nhmfl.gov 27,569
darpa.gov 21,454
usafreedomcorps.gov 18,001
peacecore.gov 17,744
californiadesert.gov 15,172
federaljudgesassoc.org 15,126
arpa.gov 15,093
transportationfortomorrow.org 14,926
okgeosurvey1.gov 14,595

Here is the same kind of table but this time for the EOT2012 dataset.

TLD Count
militaryonesource.mil 859,843
yfrog.com 682,664
staticflickr.com 640,606
akamaihd.net 384,769
4sqi.net 350,707
foursquare.com 340,492
adf.ly 334,767
pinterest.com 244,293
consumerfinance.gov 237,361
nrd.gov 194,215
wh.gov 179,233
t.co 175,033
youtu.be 172,301
sndcdn.com 161,039
pnnl.gov 132,994
eia.gov 112,034
transparency.gov 109,039
nationalguard.mil 108,854
acus.gov 93,810
nrsc.org 85,925
mzstatic.com 84,202
404.gov 82,409
savingsbondwizard.gov 76,867
treasuryhunt.gov 76,394
mynextmove.org 75,927
fedshirevets.gov 75,529
onrr.gov 75,484
veterans.gov 75,350
broadbandmap.gov 72,889
ntm-a.com 71,126

Those are pretty long tables but I think they start to point at some interesting things from this analysis.  The domains that were present and harvested in 2008 and that weren’t harvested in 2012.  In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.

In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project.  Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.

Here is a list of the twenty most changed domains from EOT2008 to EOT2012.  This number is based on the absolute change in the number of URLs captured for each of the archives.

Domain EOT2008 EOT2012 Change Abolute Change % Change
house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162%
facebook.com 11,895 7,503,640 7,491,745 7,491,745 62,982%
dvidshub.net 1,097 5,612,410 5,611,313 5,611,313 511,514%
senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97%
gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55%
nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76%
nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54%
navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739%
usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65%
loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50%
flickr.com 157,155 2,286,890 2,129,735 2,129,735 1,355%
youtube.com 346,272 2,369,108 2,022,836 2,022,836 584%
hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85%
osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073%
af.mil 230,920 2,067,812 1,836,892 1,836,892 795%
ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78%
granicus.com 782 1,785,724 1,784,942 1,784,942 228,253%
lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85%
usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54%
googleusercontent.com 2 1,560,457 1,560,455 1,560,455 78,022,750%

You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.

Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%).  I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.

Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).

.gov and .mil subdomains

The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls.  This time I wanted to look at the subdomains.

I filtered my dataset a bit so that I was only looking at the .mil and .gov content.

In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains.  This is an increase of 40% between the two archive projects.

The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.

Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below.  (Limited to the top 30)

EOT2008 Subdomain Count
gos2.geodata.gov 809,442
boucher.house.gov 772,759
kendrickmeek.house.gov 685,368
citizensbriefingbook.change.gov 446,632
stat-usa.gov 305,936
nifl.gov 285,833
scidac-new.ca.sandia.gov 247,451
tradestatsexpress.gov 243,729
hpcf.nersc.gov 221,626
gopher.info.usaid.gov 219,051
novel.nifl.gov 218,962
dli2.nsf.gov 206,932
contractorsupport.acf.hhs.gov 188,841
pnwin.nbii.gov 188,591
faq.acf.hhs.gov 184,212
ccdf.acf.hhs.gov 182,606
arnet.gov 174,018
regulations.acf.hhs.gov 171,762
acqnet.gov 171,493
dccourts.gov 161,289
employers.acf.hhs.gov 139,141
search.info.usaid.gov 137,816
web-services.gov 137,202
earth2.epa.gov 136,441
cjtf7.army.mil 134,507
ncweb-north.wr.usgs.gov 134,486
opre.acf.hhs.gov 133,689
childsupportenforcement.acf.hhs.gov 132,023
modis-250m.nascom.nasa.gov 128,810
casd.uscourts.gov 124,146

Here is the same sort of data for the EOT2012 dataset

EOT2012 Subdomain Count
militaryonesource.mil 698,035
uscodebeta.house.gov 387,080
democrats.foreignaffairs.house.gov 312,270
gulflink.fhpr.osd.mil 262,246
coons.senate.gov 257,721
democrats.energycommerce.house.gov 243,341
consumerfinance.gov 225,815
dcmo.defense.gov 217,255
nrd.gov 187,267
wh.gov 179,103
usaxs.xray.aps.anl.gov 178,298
democrats.budget.house.gov 175,109
democrats.edworkforce.house.gov 162,077
apps.militaryonesource.mil 157,144
naturalresources.house.gov 155,918
purl.fdlp.gov 154,718
media.dma.mil 137,581
algreen.house.gov 131,388
democrats.transportation.house.gov 129,345
democrats.naturalresources.house.gov 124,808
hanabusa.house.gov 123,794
pitts.house.gov 122,402
visclosky.house.gov 122,223
garamendi.house.gov 114,221
vault.fbi.gov 113,873
green.house.gov 113,040
sewell.house.gov 112,973
levin.house.gov 111,971
eia.gov 111,889
hahn.house.gov 111,024

This last table is a little long,  but I found the data pretty interesting to look at.   The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.

Subdomain/Domain EOT2008 EOT2012 Change Absolute Change % Change
listserv.access.gpo.gov 2,217,565 7,487 -2,210,078 2,210,078 -100%
carter.house.gov 1,898,462 29,680 -1,868,782 1,868,782 -98%
catalog.gpo.gov 1,868,504 34,040 -1,834,464 1,834,464 -98%
loc.gov 63,534 1,875,264 1,811,730 1,811,730 2,852%
gpo.gov 52,427 1,796,925 1,744,498 1,744,498 3,327%
bensguide.gpo.gov 90,280 1,790,017 1,699,737 1,699,737 1,883%
edocket.access.gpo.gov 1,644,578 7,822 -1,636,756 1,636,756 -100%
nws.noaa.gov 103,367 1,676,264 1,572,897 1,572,897 1,522%
navair.navy.mil 220 1,556,320 1,556,100 1,556,100 707,318%
congress.gov 1,525,467 356 -1,525,111 1,525,111 -100%
cha.house.gov 1,366,520 109,192 -1,257,328 1,257,328 -92%
usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99%
dol.gov 1,052,335 41,909 -1,010,426 1,010,426 -96%
resourcescommittee.house.gov 1,008,655 335 -1,008,320 1,008,320 -100%
calvert.house.gov 20,530 1,014,416 993,886 993,886 4,841%
fdlp.gov 989,415 1,554 -987,861 987,861 -100%
lcweb2.loc.gov 466,623 1,451,708 985,085 985,085 211%
cramer.house.gov 1,011,872 60,879 -950,993 950,993 -94%
ed.gov 1,141,069 241,165 -899,904 899,904 -79%
vaccines.mil 5,638 856,113 850,475 850,475 15,085%
clinicaltrials.gov 919,362 193,158 -726,204 726,204 -79%
army.mil 4,831 725,934 721,103 721,103 14,927%
boehner.house.gov 7,472 695,625 688,153 688,153 9,210%
nces.ed.gov 702,644 31,922 -670,722 670,722 -95%
prc.gov 739,849 75,682 -664,167 664,167 -90%
navy.mil 1,481 654,254 652,773 652,773 44,077%
house.gov 818,095 172,066 -646,029 646,029 -79%
fueleconomy.gov 675,522 79,943 -595,579 595,579 -88%
fema.gov 636,005 53,321 -582,684 582,684 -92%
frwebgate.access.gpo.gov 621,361 55,097 -566,264 566,264 -91%
siadapp.dmdc.osd.mil 43 559,076 559,033 559,033 1,300,077%
fdsys.gpo.gov 548,618 28 -548,590 548,590 -100%
tiger.census.gov 549,046 750 -548,296 548,296 -100%
rs6.loc.gov 550,489 6,695 -543,794 543,794 -99%
bennelson.senate.gov 16,203 553,698 537,495 537,495 3,317%
crapo.senate.gov 28,569 540,928 512,359 512,359 1,793%
eia.doe.gov 508,675 1,629 -507,046 507,046 -100%
epa.gov 623,457 117,794 -505,663 505,663 -81%
defenselink.mil 502,006 1,866 -500,140 500,140 -100%
access.gpo.gov 472,373 3,110 -469,263 469,263 -99%

I find this table interesting for a number of reasons.  First you see quite a bit more decline that I have seen in my other tables like this.  In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.

In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.

Wrapping Up

I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).

There are a few more things I want to look at in this work so stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – What

This post carries on from where the previous post in this series ended.

A very quick recap,  this series is trying to better understand the EOT2008 and the EOT2012 web archives.  The goal is to see how they are similar, how they are different, and if there is anything that can be learned that will help us with the upcoming EOT2016 project.

What

The CDX files we are using has a column that contains the Media Type (MIME Type) for the different URIs in the WARC files.  A list of the assigned Media Types are available at the International Assigned Numbers Authority (IANA) in their Media Type Registry.

This is a field that is inherently “dirty” for a few reasons.  This field is populated from a field in the WARC Record that comes directly from the web server that responded to the initial request.  Usually these are fairly accurate but there are many times where they are either wrong or at the least confusing.  Often times this is caused by  a server administrator, programmer, or system architect that is trying to be clever,  or just misconfigured something.

I looked at the Media Types for the two EOT collections to see if there are any major differences between what we collected in the two EOT archives.

In the EOT2008 archive there are a total of 831 unique Mime/Media Types,  in the EOT2012 there are a total of 1,208 unique type values.

I took the top 20 Mime/Media Types for each of the archives and pushed them together to see if there was any noticeable change in what we captured between the two archives.  In addition to just the raw counts I also looked at what percentage of the archive a given Media Type represented.  Finally I noted the overall change in those two percentages.

Media Type 2008 Count % of Archive 2012 Count % of Archive % Change Change in % of Archive
text/html 105,592,852 65.9% 116,238,952 59.9% 10.1% -6.0%
image/jpeg 13,667,545 8.5% 24,339,398 12.5% 78.1% 4.0%
image/gif 13,033,116 8.1% 8,408,906 4.3% -35.5% -3.8%
application/pdf 10,281,663 6.4% 7,097,717 3.7% -31.0% -2.8%
4,494,674 2.8% 613,187 0.3% -86.4% -2.5%
text/plain 3,907,202 2.4% 3,899,652 2.0% -0.2% -0.4%
image/png 2,067,480 1.3% 7,356,407 3.8% 255.8% 2.5%
text/css 841,105 0.5% 1,973,508 1.0% 134.6% 0.5%

Because I like pictures here is a chart of the percent change.

Change in Media Type

If we compare the Media Types between the two archives we find that the two archives share 527 Media Types.  The EOT2008 archive has 304 Media Types that aren’t present in EOT2012 and EOT2012 has 681 Media Types that aren’t present in EOT2008.

The ten most frequent Media Types by count found only in the EOT2008 archive are presented below.

Media Type Count
no-type 405,188
text/x-vcal 17,368
.wk1 8,761
x-text/tabular 5,312
application/x-wp 5,158
* 4,318
x-application/pdf 3,660
application/x-gunzip 3,374
image/x-fits 3,340
WINDOWS-1252 2,304

The ten most frequent Media Types by count found only in the EOT2012 archive are presented below.

Media Type Count
warc/revisit 12,190,512
application/http 1,050,895
application/x-mpegURL 23,793
img/jpeg 10,466
audio/x-flac 7,251
application/x-font-ttf 7,015
application/x-font-woff 6,852
application/docx 3,473
font/ttf 3,323
application/calendar 2,419

In the EOT2012 archive the team that captured content had fully moved to the WARC format for storing Web archive content.  The warc/revisit records are records for URLs that had not changed content-wise across more than one crawl.  Instead of storing the URL again, there is a reference to the previously captured content in the warc/revisit record.  That’s why there are so many of these Media types.

Below is a table showing the thirty most changed Media Types that are present in both the EOT2008 and EOT2012 archives.  You can see both the change in overall numbers as well as the percentage change between the two archives.

Media Type EOT2008 EOT2012 Change % Change
image/jpeg 13,667,545 24,339,398 10,671,853 78.1%
text/html 105,592,852 116,238,952 10,646,100 10.1%
image/png 2,067,480 7,356,407 5,288,927 255.8%
image/gif 13,033,116 8,408,906 -4,624,210 -35.5%
4,494,674 613,187 -3,881,487 -86.4%
application/pdf 10,281,663 7,097,717 -3,183,946 -31.0%
application/javascript 39,019 1,511,594 1,472,575 3774.0%
text/css 841,105 1,973,508 1,132,403 134.6%
text/xml 344,748 1,433,159 1,088,411 315.7%
unk 4,326 818,619 814,293 18823.2%
application/rss+xml 64,280 731,253 666,973 1037.6%
application/x-javascript 622,958 1,232,306 609,348 97.8%
application/vnd.ms-excel 734,077 212,605 -521,472 -71.0%
text/javascript 69,340 481,701 412,361 594.7%
video/x-ms-asf 26,978 372,565 345,587 1281.0%
application/msword 563,161 236,716 -326,445 -58.0%
application/x-shockwave-flash 192,018 479,011 286,993 149.5%
application/octet-stream 419,187 191,421 -227,766 -54.3%
application/zip 312,872 92,318 -220,554 -70.5%
application/json 1,268 217,742 216,474 17072.1%
video/x-flv 1,448 180,222 178,774 12346.3%
image/jpg 26,421 172,863 146,442 554.3%
application/postscript 181,795 39,832 -141,963 -78.1%
image/x-icon 45,294 164,673 119,379 263.6%
chemical/x-mopac-input 110,324 1,035 -109,289 -99.1%
application/atom+xml 165,821 269,219 103,398 62.4%
application/xml 145,141 246,857 101,716 70.1%
application/x-cgi 100,813 51 -100,762 -99.9%
audio/mpeg 95,613 179,045 83,432 87.3%
video/mp4 1,887 73,475 71,588 3793.7%

Presented as a set of graphs,  first showing the change in number of instances of a given Media Type between the two archives.

30 Media Types that changed the most

30 Media Types that changed the most

The second graph is the percentage change between the two archives.

% Change in top 30 mimetypes shared between archives

% Change in top 30 media types shared between archives

Things that stand out are the growth of application/javascript between 2008 and 2012,  up 3,774% and application/json that was up over 17,000%.  Two formats used to deliver video grew as well with video/x-flv and video/mp4 increasing 12,346% and 3794% respectively.

There were a number of Media Types that reduced in the number and percentage but they are not as dramatic as those identified above.  Of note is that between 2008 and 2012 there was a decline of 100% in content with a Media Type of application/x-cgi and a 78% decrease in files that were application/postscript.

Working with the Media Types found in large web archives is a bit messy.  While there are standard ways of presenting Media Types to browsers, there are also non-standard, experimental and inaccurate instances of Media Types that will exist in these archives.  It does appear that we can see the introduction of some of the newer technologies between the two different archives.  Technologies such as the adoption of JSON and Javascript based sites as well as new formats of video on the web.

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – When

In 2008 a group of institution comprised of the Internet Archive, Library of Congress, California Digital Library, University of North Texas, and Government Publishing Office worked together to collect the web presence of the federal government in a project that has come to be known as the End of Term Presidential Harvest 2008.

Working together this group established the scope of the project, developed a tool to collect nominations of URLs important to the community for harvesting, carried out a harvest of the federal web presence before the election, after the election, and after the inauguration of President Obama. This collection was harvested by the Internet Archive, Library of Congress, California Digital Library, and the UNT Libraries.  At the end of the EOT project the data harvested was shared between the partners with several institutions acquiring a copy of the complete EOT dataset for their local collections.

Moving forward four years the same group got together to organize the harvesting of the federal domain in 2012.  While originally scoped as a way of capturing the transition of the executive branch,  this EOT project also served as a way to systematically capture a large portion of the federal web on a four year calendar.  In addition to the 2008 partners,  Harvard joined in the project for 2012.

Again the team worked to identify in-scope content to collect, this time however the content included URLs from the social web including Twitter and Facebook for agencies, offices and individuals in the federal government.  Because there was not a change in office because of the 2012 election, there was just a set of crawls that occurred during the fall of 2012 and the winter of 2013.  Again this content was shared between the project partners interested in acquiring the archives for their own collections.

The End of Term group is a loosely organized group that comes together ever four years to conduct the harvesting of the federal web presence. As we ramp up for the end of the Obama administration the group has started to plan the EOT 2016 project with a goal to start crawling in September of 2016.  This time there will be a new president so the crawling will probably take the format of the 2008 crawls with a pre-election, post-election and post-inauguration set of crawls.

So far there hasn’t been much in the way of analysis to compare the EOT2008 and EOT2012 web archives.  There are a number of questions that have come up over the years that remain unanswered about the two collections.  This series of posts will hopefully take a stab at answering some of those questions and maybe provide better insight into the makeup of these two collections.  Finally there are hopefully a few things that can be learned from the different approaches used during the creation of these archives that might be helpful as we begin the EOT 2016 crawling.

Working with the EOT Data

The dataset that I am working with for these posts consists of the CDX files created for the EOT2008 and EOT2012 archive.  Each of the CDX files acts as an index to the raw archived content and contains a number of fields that can be useful for analysis.  All of the archive content is referenced in the CDX file.

If you haven’t looked at a CDX file in the past here is an example of a CDX file.

gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:martinelli,%20giovanni&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005312 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AMartinelli%2C+Giovanni&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 LFN2AKE4D46XEZNOP3OLXG2WAPLEKZKO - - - 533010532 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:schumann-heink,%20ernestine&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005219 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3ASchumann-Heink%2C+Ernestine&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EL5OT5NAXGGV6VADBLNP2CBZSZ5MH6OT - - - 531160983 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:scotti,%20antonio&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005255 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AScotti%2C+Antonio&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 SEFDA5UNFREPA35QNNLI7DPNU3P4WDCO - - - 804325022 LOC-EOT2012-001-20121125003257404-04183-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:viafora,%20gina&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005309 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AFarrar%2C+Geraldine&fq%5B1%5D=take_vocal_id%3AViafora%2C+Gina&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 EV6N3TMKIVWAHEHF54M2EMWVM5DP7REJ - - - 532966964 LOC-EOT2012-001-20121125003355718-04184-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz
gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:homer,%20louise&fq[1]=take_composer_name:campana,%20f.%20&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125070122 http://www.loc.gov/jukebox/search/results?count=20&fq%5B0%5D=take_vocal_id%3AHomer%2C+Louise&fq%5B1%5D=take_composer_name%3ACampana%2C+F.+&page=1&q=geraldine+farrar&referrer=%2Fjukebox%2F text/html 200 FW2IGVNKIQGBUQILQGZFLXNEHL634OI6 - - - 661008391 LOC-EOT2012-001-20121125064213479-04227-15895~wbgrp-crawl012.us.archive.org~8443.warc.gz

The CDX format is a space delimited file with the following fields

  • SURT formatted URI
  • Capture Time
  • Original URI
  • MIME Type
  • Response Code
  • Content Hash (SHA1)
  • Redirect URL
  • Meta tags (not populated)
  • Compressed length (sometimes populated)
  • Offset in WARC file
  • WARC File Name

The tools I’m working with to analyze the EOT datasets will consist of Python scripts that either extract specific data from the CDX files where it can be further sorted and counted, or they will be scripts that work on these sorted and counted versions of files.

I’m trying to post code and derived datasets in a Github repository called eot-cdx-analysis if you are interested in taking a look.  There is also a link to the original CDX datasets there as well.

How much

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  Unfortunately the CDX files that we are working with don’t have consistent size information that we can use for analysis but the rough sizes for each of the archives is EOT2008 at 16TB and EOT2012 at just over 41.6TB.

When

The first dimension I wanted to look at was when was the content harvested for each of the EOT rounds.  In both cases we all remember starting the harvesting “sometime in September” and then ending the crawls “sometime in March” of the following year.  How close were we to our memory?

For this I extracted the Capture Time field from the CDX file, converted that into a date yyyy–mm-dd was a decent bucket to group into and then sorted and counted each instance of a date.

EOT2008 Harvest Dates

EOT2008 Harvest Dates

This first chart shows the harvest dates contained in the EOT2008 CDX files.  Things got kicked off in September 2008 and apparently concluded all the way in OCT 2009.  There is another blip of activity in May of 2009.  This is probably something to go back and look at to help remember what exactly these two sets of crawling were that happened after March 2009 when we all seem to remember crawling stopping.

EOT2012 Harvest Dates

EOT2012 Harvest Dates

The EOT2012 crawling started off in mid-September and this time finished up in the first part of March 2013.  There is a more consistent shape to the crawling for this EOT with a pretty consistent set of crawling happening between mid-October and the end of January.

EOT2008 and EOT2012 Harvest Dates Compared

EOT2008 and EOT2012 Harvest Dates Compared

When you overlay the two charts you can see how the two compare.  Obviously the EOT2008 data continues quite a bit further than the EOT2012 but where they overlap you can see that there were different patterns to the collecting.

Closing

This is the first of a few posts related to web archiving and specifically to comparing the EOT2008 and EOT2012 archives.  We are approaching the time to start the EOT2016 crawls and it would be helpful to have more information about what we crawled in the two previous cycles.

In addition to just needing to do this work there will be a presentation on some of these findings as well as other types of analysis at the 2016 Web Archiving and Digital Libraries (WADL) workshop that is happening at the end of JCDL2016 this year in Newark, NJ.

If there are questions you have about the EOT2008 or EOT2012 archives please get in contact with me and we can see if we can answer them.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Description Fields: Language used in descriptions.

This is the last post in a series of posts related to the Description field found in the Digital Public Library of America.  I’ve been working with a collection of 11,654,800 metadata records for which I’ve created a dataset of 17,884,946 description fields.

This past Christmas I received a copy of Thing Explainer by Randall Munroe,  if you aren’t familiar with this book, Randall uses only the most used ten hundred words (thousand isn’t one of them) to describe very complicated concepts and technologies.

After seeing this book I started to wonder how much of the metadata we create for our digital objects use just the 1,000 most frequent words.  Often frequently used words, as well as less complex words (words with fewer syllables) are used in the calculation of the reading level of various texts so that also got me thinking about the reading level required to understand some of our metadata records.

Along that train of thought,  one of the things that we hear from aggregations of cultural heritage materials is that K-12 users are a target audience we have and that many of the resources we digitize are with them in mind.  With that being said, how often do we take them into account when we create our descriptive metadata?

When I was indexing the description fields I calculated three metrics related to this.

  1. What percentage of the tokens are in the 1,000 most frequently used English words
  2. What percentage of the tokens are in the 5,000 most frequently used English words
  3. What percentage of the tokens are words in a standard English dictionary.

From there I was curious about how the different providers compared to each other.

Average for 1,000, 5,000 and English Dictionary

1,000 most Frequent English Words

The first thing we will look at is the average of amount of a description composed of words from the list of the 1,000 most frequently used English words.

Average percentage of description consisting of 1000 most frequent English words.

Average percentage of description consisting of 1000 most frequent English words.

For me the providers/hubs that I notice are of course bhl that has very little usage of the 1,000 word vocabulary.  This is followed by smithsonian, gpo, hathitrust and uiuc.  On the other end of the scale is virginia that has an average of 70%.

5,000 most Frequent English Words

Next up is the average percentage of the descriptions that consist of words from the 5,000 most frequently used English words.

Average percentage of description consisting of 5000 most frequent English words.

Average percentage of description consisting of 5000 most frequent English words.

This graph ends up looking very much like the 1,000 words graph, just a bit higher percentage wise.  This is due to the fact of course that the 5,000 word list includes the 1,000 word list.  You do see a few changes in the ordering though,  for example gpo switches places with hathitrust in this graph over the 1,000 words graph above.

English Dictionary Words

Next is the average percentage of descriptions that consist of words from a standard English dictionary.  Again this includes the 1,000 and 5,000 words in that dictionary so it will be even higher.

Average percentage of description consisting of English dictionary words.

Average percentage of description consisting of English dictionary words.

You see that the virginia hub has almost 100% or their descriptions consisting of English dictionary words.  The hubs that are the lowest in their use of English words for descriptions are bhl, smithsonian, and nypl.

The graph below has 1,000, 5,000, and English Dictionary words grouped together for each provider/hub so you can see at a glance how they stack up.

1,000, 5,000 most frequent English words and English dictionary words by Provider

1,000, 5,000 most frequent English words and English dictionary words by Provider

Stacked Percent 1,000, 5,000, English Dictionary

Next we will look at the percentages per provider/hub if we group the percentage utilization into 25% buckets.  This gives a more granular view of the data than just the averages presented above.

Percentage of descriptions by provider that use 1,000 most frequent English words.

Percentage of descriptions by provider that use 1,000 most frequent English words.

Percentage of descriptions by provider that use 5,000 most frequent English words.

Percentage of descriptions by provider that use 5,000 most frequent English words.

Percentage of descriptions by provider that use English dictionary words.

Percentage of descriptions by provider that use English dictionary words.

Closing

I don’t think it is that much of a stretch to draw parallels between the language used in our descriptions and the intended audience of our metadata records. How often are we writing metadata records for ourselves instead of our users?  A great example that comes to mind is “verso” or “recto” that we use often for “front” and “back” of items. In the dataset I’ve been using there are 56,640 descriptions with the term “verso” and 5,938 with the term “recto”.

I think we should be taking into account our various audiences when we are creating metadata records.  I know this sounds like a very obvious suggestion but I don’t think we really do that when we are creating our descriptive metadata records.  Is there a target reading level for metadata records? Should there be?

Looking at the description fields in the DPLA dataset has been interesting.  The kind of analysis that I’ve done so far can be seen as kind of a distant reading of these fields. Big round numbers that are pretty squishy and only show the general shape of the field.  To dive in and do a close reading of the metadata records is probably needed to better understand what is going on in these records.

Based on experience of mapping descriptive metadata into the Dublin Core metadata fields, I have a feeling that the description field is generally a dumping ground for information that many of us might not consider “description”.  I sometimes wonder if it would do our users a greater service by adding a true “note” field to our metadata models so that we have a proper location to dump “notes and other stuff” instead of muddying up a field that should have an obvious purpose.

That’s about it for this work with descriptions,  or at least it is until I find some interest in really diving deeper into the data.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Description Fields: More statistics (so many graphs)

In the past few posts we looked at the length of the description fields in the DPLA dataset as a whole and at the provider/hub level.

The length of the description field isn’t the only field that was indexed for this work.  In fact I indexed on a variety of different values for each of the descriptions in the dataset.

Below are the fields I currently am working with.

Field Indexed Value Example
dpla_id 11fb82a0f458b69cf2e7658d8269f179
id 11fb82a0f458b69cf2e7658d8269f179_01
provider_s usc
desc_order_i 1
description_t A corner view of the Santa Monica City Hall.; Streetscape. Horizontal photography.
desc_length_i 82
tokens_ss “A”, “corner”, “view”, “of”, “the”, “Santa”, “Monica”, “City”, “Hall”, “Streetscape”, “Horizontal”, “photography”
token_count_i 12
average_token_length_f 5.5833335
percent_int_f 0
percent_punct_f 0.048780486
percent_letters_f 0.81707317
percent_printable_f 1
percent_special_char_f 0
token_capitalized_f 0.5833333
token_lowercased_f 0.41666666
percent_1000_f 0.5
non_1000_words_ss “santa”, “monica”, “hall”, “streetscape”, “horizontal”, “photography”
percent_5000_f 0.6666667
non_5000_words_ss “santa”, “monica”, “streetscape”, “horizontal”
percent_en_dict_f 0.8333333
non_english_words_ss “monica”, “streetscape”
percent_stopwords_f 0.25
has_url_b FALSE

This post will try and pull together some of the data from the different fields listed above and present them in a way that we will hopefully be able to use to derive some meaning from.

More Description Length Discussion

In the previous posts I’ve primarily focused on the length of the description fields.  There are two other fields that I’ve indexed that are related to the length of the description fields.  These two fields include the number of tokens in a description and the average token length of fields.

I’ve included those values below.  I’ve included two mean values, one for all of the descriptions in the dataset (17,884,946 descriptions) and in the other the descriptions that are 1 character in length or more (13,771,105descriptions).

Field Mean – Total Mean – 1+ length
desc_length_i 83.321 108.211
token_count_i 13.346 17.333
average_token_length_f 3.866 5.020

The graphs below are based on the numbers of just descriptions that are 1+ length or more.

This first graph is being reused from a previous post that shows the average length of description by Provider/Hub.  David Rumsey and the Getty are the two that average over 250 characters per description.

Average Description Length by Hub

Average Description Length by Hub

It shouldn’t surprise you that David Ramsey and Getter are two of the Providers/Hubs that have the highest average token counts,  with longer descriptions generally creating more tokens. There are a few differences that don’t match this though,  USC that has an average of just over 50 characters for the average description length comes in as the third highest in the average token counts at over 40 tokens per description.  There are a few other providers/hubs that look a bit different than their average description length.

Average Token Count by Provider

Average Token Count by Provider

Below is a graph of the average token lengths by providers.  The lower the number is the lower average length of a token.  The mean for the entire DPLA dataset for descriptions of length 1+ is just over 5 characters.

Average Token Length by Provider

Average Token Length by Provider

That’s all I have to say about the various statistics related to length for this post.  I swear!. Next we move on to some of the other metrics that I calculated when indexing things.

Other Metrics for the Description Field

Throughout this analysis I had a question of when to take into account that there were millions of records in the dataset that had no description present.  I couldn’t just throw away that fact in the analysis but I didn’t know exactly what to do with them.  So below I present statistics for the average of many of the fields I indexed as both the mean of all of the descriptions and then the mean of just the descriptions that are one or more characters in length.  The graphs that follow the table below are all based on the subset of descriptions that are greater than or equal to one character in length.

Field Mean – Total Mean – 1+ length
percent_int_f 12.368% 16.063%
percent_punct_f 4.420% 5.741%
percent_letters_f 50.730% 65.885%
percent_printable_f 76.869% 99.832%
percent_special_char_f 0.129% 0.168%
token_capitalized_f 26.603% 34.550%
token_lowercased_f 32.112% 41.705%
percent_1000_f 19.516% 25.345%
percent_5000_f 31.591% 41.028%
percent_en_dict_f 49.539% 64.338%
percent_stopwords_f 12.749% 16.557%

Stopwords

Stopwords are words that occur very commonly in natural language.  I used a list of 127 stopwords for this work to help understand what percentage of a description (based on tokens) is made up of stopwords.  While stopwords generally carry little meaning for natural language, they are a good indicator of natural language,  so providers/hubs that have a higher percentage of stopwords would probably have more descriptions that resemble natural language.

Percent Stopwords by Provider

Percent Stopwords by Provider

Punctuation

I was curious about how much punctuation was present in a description on average.  I used the following characters as my set of “punctuation characters”

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I found the number of characters in a description that were made up of these characters vs other characters and then divided the number of punctuation characters by the total description length to get the percentage of the description that is punctuation.

Percent Punctuation by Provider

Percent Punctuation by Provider

Punctuation is common in natural language but it occurs relatively infrequently. For example that last sentence was eighty characters long and only one of them was punctuation (the period at the end of the sentence). That comes to a percent_punctuation of only 1.25%.  In the graph above you will see the the bhl provider/hub has over 50% of their description with 25-49% punctuation.  That’s very high when compared to the other hubs and the fact that there is an average of about 5% overall for the DPLA dataset. Digital Commonwealth has a percentage of descriptions that are from 50-74% punctuation which is pretty interesting as well.

Integers

Next up in our list of things to look at is the percentage of the description field that consists of integers.  For review,  integers are digits,  like the following.

0123456789

I used the same process for the percent integer as I did for the percent punctuation mentioned above.

Percent Integer by Provider

Percent Integer by Provider

You can see that there are several providers/hubs that have quite a high percentage integer for their descriptions.  These providers/hubs are the bhl and the smithsonian.  The smithsonian has over 70% of its descriptions with percent integers of over 70%.

Letters

Once we’ve looked at punctuation and integers,  that leaves really just letters of the alphabet to makeup the rest of a description field.

That’s exactly what we will look at next. For this I used the following characters to define letters.

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

I didn’t perform any case folding so letters with diacritics wouldn’t be counted as letters in this analysis,  but we will look at those a little bit later.

Percent Letter by Provider

Percent Letter by Provider

For percent letters you would expect there to be a very high percentage of the descriptions that themselves contain a high percentage of letters in the description.  Generally this appears to be true but there are some odd providers/hubs again mainly bhl and the smithsonian,  though nypl, kdl and gpo also seem to have a different distribution of letters than others in the dataset.

Special Characters

The next thing to look at was the percentage of “special characters” used in a description.  For this I used the following definition of “special character”.  If a character is not present in the following list of characters (which also includes whitespace characters) then it is considered to be a “special character”

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~  
Percent Special Character by Provider

Percent Special Character by Provider

A note in reading the graph above,  keep in mind that the y-axis is only 95-100% so while USC looks different here it only represents 3% of its descriptions that have 50-100% of the description being special characters.  Most likely a set of descriptions that have metadata created in a non-english language.

URLs

The final graph I want to look at in this post is the percentage of descriptions for a provider/hub that has a URL present in its description.  I used the presence of either http:// or https:// in the description to define if it does or doesn’t have a URL present.

Percent URL by Provider

Percent URL by Provider

The majority providers/hubs don’t have URLs in their descriptions with a few obvious exceptions.  The provider/hubs of washington, mwdl, harvard, gpo and david_ramsey do have a reasonable number of descriptions with URLs with washington leading with almost 20% of their descriptions having a URL present.

Again this analysis is just looking at what high-level information about the descriptions can tell us.  The only metric we’ve looked at that actually goes into the content of the description field to pull out a little bit of meaning is the percent stopwords.  I have one more post in this series before we wrap things up and then we will leave descriptions in the DPLA along for a bit.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Descriptive Metadata Lengths: By Provider/Hub

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider min max count sum mean stddev
artstor 0 6,868 128,922 9,413,898 73.02 178.31
bhl 0 100 123,472 775,600 6.28 8.48
cdl 0 6,714 563,964 65,221,428 115.65 211.47
david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92
digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09
digitalnc 1 9,785 241,275 45,759,118 189.66 262.89
esdn 0 9,136 197,396 23,620,299 119.66 170.67
georgia 0 12,546 875,158 135,691,768 155.05 210.85
getty 0 2,699 264,268 80,243,547 303.64 273.36
gpo 0 1,969 690,353 33,007,265 47.81 58.20
harvard 0 2,277 23,646 2,424,583 102.54 194.02
hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03
indiana 0 4,477 73,385 6,893,350 93.93 189.30
internet_archive 0 7,685 523,530 41,713,913 79.68 174.94
kdl 0 974 144,202 390,829 2.71 24.95
mdl 0 40,598 483,086 105,858,580 219.13 345.47
missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08
mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51
nara 0 2,000 700,948 1,425,165 2.03 28.13
nypl 0 2,633 1,170,357 48,750,103 41.65 161.88
scdl 0 3,362 159,681 18,422,935 115.37 164.74
smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37
the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95
tn 0 46,312 151,334 30,513,013 201.63 248.79
uiuc 0 4,942 63,412 3,782,743 59.65 172.44
undefined_provider 0 469 11,436 2,373 0.21 6.09
usc 0 29,861 1,076,031 60,538,490 56.26 193.20
virginia 0 268 30,174 301,042 9.98 17.91
washington 0 1,000 42,024 5,258,527 125.13 177.40

This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post,  please let me know via Twitter.