Author Archives: vphill

About vphill

you know.. me and stuff

How do metadata records change over time?

Since September 2009 the UNT Libraries has been versioning the metadata edits that happen in the digital library system that powers The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.  In those eight years the collection has grown from a modest size of 66,000 digital objects to the 1,814,000 digital objects that we manage today.  We’ve always tried to think of the metadata in our digital library as a constantly changing dataset.  Just how much it changes we don’t always pay attention to.

In 2014 a group of us worked on a few papers about metadata change at a fairly high level in the repository. How Descriptive Metadata Changes in the UNT Libraries Collections: A Case Study  this paper reported out on the analysis of almost 700,000 records that were in the repository at that time.  Another study Exploration of Metadata Change in a Digital Repository was presented the following year in 2015 by some colleagues in the UNT College of Information that used a smaller sample of records to answer a few more questions about what changes in descriptive metadata at the UNT Libraries.

It has been a few years since these studies so it is time again to take a look at our metadata and do a little analysis to see if anything pops out.

Metadata Edit Dataset

The dataset we are using for this analysis was generated on May 4th, 2017 by creating a copy of all of the metadata records and their versions to a local filesystem for further analysis. The complete dataset is for 1,811,640 metadata records.

Of those 1,811,640 metadata records, 683,933 had been edited at least once since they were loaded into the repository.  There are 62% of the records that have just one instance (no changes) in the system and another 38% that have at least one edit.

Records Edited in Dataset

We store all of our metadata on the filesystem as XML files using a local metadata format we call UNTL.  When a record is edited, the old version of the record is renamed with a version number and the new version of the record takes its place as the current version of a record. This has worked pretty well over the years for us and allows us to view previous versions of metadata records through a metadata history screen in our metadata system.

UNT Metadata History Interface

This metadata history view is helpful for tracking down strange things that happen in metadata systems from time to time.  Because some records are edited multiple times (like in the example screenshot above) we end up with a large number of metadata edits that we can look at over time.

After staging all of the metadata records on a local machine I wrote a script that would compare two different records and output which elements in the record changed. While this sounds like a pretty straight forward thing to do, there are some fiddly bits that you need to watch out for that I will probably cover in a separate blog post. Most of these have to do with XML as a serialization format and some of the questions on how you interpret different things.  As a quick example think about these three notations.

<title></title>
<title />
<title qualifier='officialtitle'></title>

When comparing fields should those three examples all mean the same thing as far as a metadata record is concerned?  But like I said something to get into for a later post.

Once I had my script to compare two records, the next step was to create pairs of records to compare and then iterate over all of those record pairs.  This resulted in 1,332,936 edit events that I could look at.  I created a JSON document for each of these edit events and then loaded this document into Solr for some later analysis.  Here is what one of these records looks like.

{
  "change_citation": 0,
  "change_collection": 0,
  "change_contributor": 1,
  "change_coverage": 0,
  "change_creator": 0,
  "change_date": 0,
  "change_degree": 0,
  "change_description": 0,
  "change_format": 0,
  "change_identifier": 1,
  "change_institution": 0,
  "change_meta": 1,
  "change_note": 0,
  "change_primarySource": 0,
  "change_publisher": 0,
  "change_relation": 0,
  "change_resourceType": 0,
  "change_rights": 0,
  "change_source": 0,
  "change_subject": 0,
  "change_title": 0,
  "collections": [
    "NACA",
    "TRAIL"
  ],
  "completeness_change": 0,
  "content_length_change": 12,
  "creation_to_edit_seconds": 123564535,
  "edit_number": 1,
  "elements_changed": 3,
  "id": "metadc58589_2015-10-16T11:02:09Z",
  "institution": [
    "UNTGD"
  ],
  "metadata_creation_date": "2011-11-16T07:33:14Z",
  "metadata_edit_date": "2015-10-16T11:02:09Z",
  "metadata_editor": "htarver",
  "r1_ark": "ark:/67531/metadc58589",
  "r1_completeness": 0.9830508474576272,
  "r1_content_length": 2108,
  "r1_record_length": 2351,
  "r2_ark": "ark:/67531/metadc58589",
  "r2_completeness": 0.9830508474576272,
  "r2_content_length": 2120,
  "r2_record_length": 2543,
  "record_length_change": 192,
  "systems": "DC"
}

Some of the fields don’t mean much now but the main fields we want to look at are the change_* fields.  These represent the 21 metadata elements that we have use here for the UNTL metadata format.  Here they are in a more compact view.

  • title
  • creator
  • contributor
  • publisher
  • date
  • description
  • subject
  • primarySource
  • coverage
  • source
  • citation
  • relation
  • collection
  • institution
  • rights
  • resourceType
  • format
  • identifier
  • degree
  • note
  • meta

You may notice that these elements include the 15 Dublin Core elements plus six other fields that we’ve found useful to have in our element set.

The first thing I wanted to answer was which of these 21 fields was edited the most in the 1.3 million records edits that we have.

Metadata Element Changes

You can see that the meta field in the records changes almost 100% of the time.  That is because whenever you edit a record the values of the most recent metadata editor and the edit time change so the values of the elements should change each edit.

I have to admit that I was surprised that the description field was the most edited field in the metadata edits.  There were 403,713 (30%) of the edits that had the description field change in some way. This is followed by title at 304,396 (23%) and subject at 272,703 (20%).

There are a number of other things that I will be doing with this dataset as I move forward. In addition to what fields changed I should be able to look at how many field on average change in records.  I then want to see if there are any noticeable differences when you look at different subsets like specific editors, or collections.

So if you are interested in metadata change stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.

Compressibility of the DPLA Creator Field by Hub

This is the second post in a series of posts exploring the metadata from the Digital Public Library of America.

In the first post I introduced the idea of using compressibility of a field as a measure of quality.

This post I wanted to look specifically at the dc.creator field in the DPLA metadata dataset.

DC.Creator Overview

The first thing to do is to give you an overview of the creator field in the DPLA metadata dataset.

As I mentioned in the last post there are a total of 15,816,573 records in the dataset I’m working with.  These records are contributed from a wide range of institutions from across the US through Hubs.  There are 32 hubs present in the dataset with 102 records that for one reason or another aren’t associated with a hub and which have a “None” for the hub name.

In the graph below you can see how the number of records are distributed across the different hubs.

Total Records by Hub

These are similar numbers to what you see in the more up-to-date numbers on the DPLA Partners page.

The next chart shows how the number of records per hub and the number of records with creator values compare.

Total Records and Records with Creators by Hub

You should expect that the red columns in the table above will most often be shorter than the blue columns.

Below is a little bit different way of looking at that same data.  This time it is the percentage of records that contain a creator.

Records with Creator to Total Records

You see that a few of the hubs have almost 100% of their records with a creator, while others have a very low percentage of records with creators.

Looking at the number of records that have a creator value and then the total number of names you can see that some hubs like hathitrust have pretty much a 1 to 1 name to record ratio while others like nara have multiple names per record.

Total Creators and Name Instances

To get an even better sense of this you can look at the average creator/name per record. In this chart you see that david_rumsey has 2.49 creators per record, this is followed by nara at 2.03, bhl with 1.78 and internet_archive at 1.70. There are quite a few (14) hubs that have very close to 1 name per record on average.

Average Names Per Record

The next thing to look at is the number of unique names per hub.  The hathitrust hub sticks out again with the most unique names for a hub in the DPLA.

Unique Creators by Hub

Looking at the ratio between the number of unique names and number of creator instances you can see there is something interesting happening with the nara hub.  I put the chart below on a logarithmic scale so you can see things a little better.  Notice that nara has a 1,387:1 ratio between the number of unique creators and the creator instances.

Creator to Unique Ratio

One way to interpret this is to say that the hubs that have the higher ratio have more records that share the same name/creator among records.

Compressibility

Now that we have an overview of the creator field as a whole we want to turn our attention to the compressibility of each of the fields.

I decided to compare the results of four different algorithms, lowercase hash, normalize hash, fingerprint hash, and aggressive fingerprint hash. Below is a table that shows the number of unique values for that field after each of the values has been hashed.  You will notice that as you read from left to right the number will go down.  This relates to the aggressiveness of the hashing algorithms being used.

Hub Unique Names Lowercase Hash Normalize Hash Fingerprint Hash Aggressive Fingerprint Hash
artstor 7,552 7,547 7,550 7,394 7,304
bhl 44,936 44,927 44,916 44,441 42,960
cdl 47,241 46,983 47,209 45,681 44,676
david_rumsey 8,861 8,843 8,859 8,488 8,375
digital-commonwealth 32,028 32,006 32,007 31,783 31,568
digitalnc 31,016 30,997 31,006 30,039 29,730
esdn 22,401 22,370 22,399 21,940 21,818
georgia 21,821 21,792 21,821 21,521 21,237
getty 2,788 2,787 2,787 2,731 2,724
gpo 29,900 29,898 29,898 29,695 29,587
harvard 4,865 4,864 4,855 4,845 4,829
hathitrust 876,773 872,702 856,703 838,848 780,433
il 16,014 15,971 15,983 15,569 15,409
indiana 6,834 6,825 6,832 6,692 6,650
internet_archive 105,381 105,302 104,820 102,390 99,729
kdl 3,098 3,096 3,098 3,083 3,066
mdl 69,617 69,562 69,609 69,013 68,756
michigan 2,725 2,715 2,723 2,676 2,675
missouri-hub 5,160 5,154 5,160 5,070 5,039
mwdl 49,836 49,724 49,795 48,056 47,342
nara 1,300 1,300 1,300 1,300 1,249
None 21 21 21 21 21
nypl 24,406 24,406 24,388 23,462 23,130
pennsylvania 10,350 10,318 10,349 10,056 9,914
scdl 11,976 11,823 11,973 11,577 11,368
smithsonian 67,941 67,934 67,826 67,242 65,705
the_portal_to_texas_history 28,686 28,653 28,662 28,154 28,066
tn 2,561 2,556 2,561 2,487 2,464
uiuc 3,524 3,514 3,522 3,470 3,453
usc 10,085 10,061 10,071 9,872 9,785
virginia 3,732 3,732 3,732 3,731 3,681
washington 12,674 12,642 12,669 12,184 11,659
wisconsin 19,973 19,954 19,960 19,359 19,127

Next I will work through each of the hashing algorithms and look at the compressibility of each field after the given algorithm has been applied.

Lowercase Hash: This hashing algorithm will convert all uppercase characters to lowercase and leave all lowercase characters unchanged.  The result of this is generally very low amounts of compressibility for each of the hubs.  You can see this in the chart below.

Lowercase Hash Compressibility

Normalize HashThis has just converts characters down to their ascii equivalent.  For example it converts gödel to godel.  The compressibility results of this hashing function are quite a bit different than the lowercase hash from above.  You see that hathitrust has 2.3% compressibility of its creator names.

Normalize Hash Compressibility

Fingerprint Hash: This uses the algorithm that OpenRefine describes in depth here.  In the algorithm it incorporates a lowercase function as well as a normalize function in the overall process.  You can see that there is a bit more consistency between the different compressibility values.

Fingerprint Hash Compressibility

Aggressive Fingerprint Hash: This algorithm takes the basic fingerprint algorithm described above and adds one more step.  That step is to remove pieces of the name that are only numbers such as date.  This hashing function will most likely have more false positives that any of the previous algorithms, but it is interesting to look at the results.

Aggressive Fingerprint Hash Compressibility

This final chart puts together the four previous charts so they can be compared a bit easier.

All Compressibility

Conclusion

So now we’ve looked at the compressibility of the the creator fields for each of the 32 hubs that make up the DPLA.

I’m not sure that I have any good takeaways so far in this analysis. I think there are a few other metrics that we should look at before we start saying if this information is or isn’t useful as a metric of metadata quality.

I do know that I was with the compressibility of the hathitrust creators. This is especially interesting when you consider that the source for most of those records are MARC based catalog records that in theory should be backed up with some sort of authority records. Other hubs, especially the service hubs tend to not have records that are based as much on authority records.  Not really ground breaking but interesting to see in the data.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Metadata Fun: Compression as a measure of data quality

The past week had the opportunity to participate in an IMLS-funded workshop about managing local authority records hosted by Cornell University at the Library of Congress.  It was two days of discussions about issues related to managing local and aggregated name authority records. This meeting got me thinking more about names in our digital library metadata both locally (at UNT) and in aggregations (DPLA).

It has been a while since I worked on a project with the DPLA metadata dataset that they provide for bulk download so I figured it was about time to grab a copy and poke around a bit.

This time around I’m interested in looking at some indicators of metadata quality.  Loosely it is how well does a set of metadata conform to itself.  Specifically I want to look at how name values from the dc.creator, dc.contributor, and dc.publisher compare with each other.

I’ll give a bit of an overview to get us started.

If we had these four values in a set of metadata for say the dc.creator of an awesome movie.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we sort these values, make them unique, and then count the instances, we will get the following.

1  Alexander Johan Hjalmar Skarsgard
1  Alexander Johan Hjalmar Skarsgård
1  Skarsgard, Alexander Johan Hjalmar
1  Skarsgård, Alexander Johan Hjalmar

So we have 4 unique name strings in our dataset.

If we applied a normalization algorithm that turned the letter å into an a and then tried to make our data unique we would end up with the following.

2  Alexander Johan Hjalmar Skarsgard
2  Skarsgard, Alexander Johan Hjalmar

Now we have only two name strings in the dataset, each with an instance count of two.

We can measure the compression rate by taking the original number of instances and dividing it by this new number.  4/2 = 2 or a 2:1 compression rate.

Another way to do it is to get the amount of space saved with this compression.  This is just a different equation.  1 – 2/4 = 0.5 or a 50% space savings.

If we apply an algorithm similar to the one that OpenRefine uses and calls a “fingerprint” we can get the following from our first four values.

4 alexander hjalmar johan skarsgard

Now we’ve gone from four values down to one for a 4:1 compression rate or we’ve created a 75% space savings.

Relation to Quality

When we go back to our first four examples, we can come to the opinion pretty quickly that these are most likely supposed to be the same name.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we saw this in our databases we would want to clean these up.  They would most likely lead to poor faceting in our discovery interface.  If a user wanted to find other items that had a dc.creator of Skarsgård, Alexander Johan Hjalmar, it is possible that they wouldn’t find any of the other three items when they clicked on a link to show more.

If we can agree that reducing the number of “near matches” in the dataset is an improvement, we might be able to use these data compression measures as a way of identifying which parts of a digital library might have consistency problems.

That’s exactly what I’m proposing to do here.  I want to find out if we can use a number of different algorithms on the values of dc.creator, dc.contributor, and dc.publisher in the DPLA metadata set and see how much these values compress the data.

Preparing the Data

I’m going to start with the all.json.gz file from the DPLA’s bulk metadata download page.

This file is a very large json file containing 15,816,573 records from the April 2017 DPLA metadata dump.

The first thing that I want to do is to reduce this dataset, which is 6.1GB compressed so something a little more manageable.  I will start with the dc_creator information.  I will use a set of commands for the wonderful tool jq that gets me what I’m wanting.

jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (._source.id), 'creator': ._source.sourceResource.creator?}'

The command I used above will transform each of the records in the DPLA dataset into something that looks like this:

{"provider":"http://dp.la/api/contributor/uiuc","id":"705e1e5f19331a6c8a554ce707059288","creator":null}
{"provider":"http://dp.la/api/contributor/uiuc","id":"bcae15d47f2544caf0407b1e17bf97cd","creator":["Harlow, G","Rogers, J"]}
{"provider":"http://dp.la/api/contributor/uiuc","id":"96cab3354d942e7ea2030f1452f5beb8","creator":["Drummond, S","Ridley, W"]}
{"provider":"http://dp.la/api/contributor/uiuc","id":"e3ce5090d0a8b3c247c84d6f0d5ff16e","creator":["Barber, J.T","Cardon, A"]}

This is now a large file with one small snippet of json on each line.  I can write straightforward Python scripts to process these lines and do some of the heavy lifting for analysis.

For this first pass I’m interested in all of the dc.creators in the whole DPLA dataset to measure the overall compression.

Here is a short set of these values.

Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Committee on Merchant Marine and Fisheries
Herdman, W. A. Sir, (William Abbott), 1858-1924
United States. Committee on Merchant Marine and Fisheries
Henderson, Joseph C
Fancher Creek Nurseries
Roeding, George Christian, 1868-1928
Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Animal and Plant Health Inspection Service
United States. Bureau of Entomology and Plant Quarantine
United States. Plant Pest Control Branch
United States. Plant Pest Control Division

The full list is 10,413,292 lines long when I ignore record instances that don’t have any value for creator.

The next thing to do is sort that list and make it unique which leaves me 1,445,688 unique creators in the DPLA metadata dataset.

Compressing the Data

For the first pass through the data I am going to use the “fingerprint algorithm” that OpenRefine describes in depth here.

The basics are as follows (from OpenRefine’s documentation)

  • remove leading and trailing whitespace
  • change all characters to their lowercase representation
  • remove all punctuation and control characters
  • split the string into whitespace-separated tokens
  • sort the tokens and remove duplicates
  • join the tokens back together
  • normalize extended western characters to their ASCII representation (for example “gödel” → “godel”)

If you’re curious, the code that performs this is in OpenRefine is here.

The next steps are to run this fingerprinting algorithm on each of the1,445,688 creators, sort the created hash values, make them unique and count the resulting lines.  This gives you the new unique creators based on the fingerprint algorithm.

I end up with 1,365,922 unique creator values based on the fingerprint.

That comes to a reduction of 5.52% of the unique values.

To give you an idea of what this looks like for values.  There are eleven different creator instances that have the fingerprint of “akademiia imperatorskaia nauk russia”.

  • Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskai͡a akademii͡a nauk (Russia)
  • Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)
  • Imperatorskaia akademiia nauk (Russia)
  • Imperatorskai͡a akademïi͡a nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)

These 11 different versions of this name are distributed among five different DPLA Hubs.

Below is a table showing how the different versions are distributed across hubs.

Name Records bhl hathitrust internet_archive nypl smithsonian
Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 13 0 11 2 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 7 0 7 0 0 0
Imperatorskai͡a akademii͡a nauk (Russia) 3 0 3 0 0 0
Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia) 1 0 1 0 0 0
Imperatorskaia akademīia nauk (Russia) 13 0 0 0 0 13
Imperatorskaia akademiia nauk (Russia) 4 0 0 0 4 0
Imperatorskai͡a akademïi͡a nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 11 0 11 0 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 13 0 13 0 0 0
Imperatorskaia akademīia nauk (Russia) 211 211 0 0 0 0

When you look at the table you will see that bhl, internet_archive, nypl, and smithsonian each have their preferred way of representing this name.  Hathitrust however has eight different ways that it represents this single creator name in its dataset.

Next Steps

This post hopefully introduced the idea of using “field compressions” for name fields like dc.creator, dc.contributor, and dc.publisher as a way of looking at metadata quality in a dataset.

We calculated the amount of compression using OpenRefine’s fingerprint algorithm for the DPLA creator fields.  This ends up being 5.52% compression.

In the next few posts I will compare the different DPLA Hubs to see how they compare with each other.  I will probably play with a few different algorithms for creating the hash values I use.  Finally I will calculate a few metrics in addition to just the unique values (cardinality) of the field.

If you have questions or comments about this post,  please let me know via Twitter.

UNT Libraries’ Digital Collections 2016 in Review: Items

This post is just an overview of the 2016 year for the UNT Libraries’ Digital Collections.  I have wanted to do one of these for a number of years now but never really got around to it.  So here we go.

I plan to look at two areas of activity for the digital collections.  Content added, usage, and some info on metadata curation activities.  This first post will focus on items added.

Items added

From January 1, 2016 until December 31, 2016 we added a total of 295,077 new items to the UNT Libraries’ Digital Collections.  The UNT Libraries’ Digital Collections encompasses The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.  The graphic below shows the number of records added to each of the systems throughout the year.

Items Added by System

The Portal to Texas History (PTH in the chart) had the most items added at 145,268 new items.  This was followed by the UNT Digital Library (DC in the chart) with 124,402 items and finally the Gateway to Oklahoma History (OK in the chart) with 25,809 new items.

If you look at files (often ‘pages’) instead of items the graph will change a bit.

New Pages by System

While we added the most items to The Portal to Texas History, we added the most pages of content to the UNT Digital Library.  In total we added 5,704,046 files to the Digital Collections in 2016.

Added by Date

The number of items added per month is a good way of getting an overview of activity across the year.  The graphic below presents that data.

New Items By Month

The average number of items added per months is 24,590 which is a very respectable number. When you look at the number of items added on a given day during the year, the graph is a bit harder to read but you can see some days that had quite a bit of data loading going on.

New Items Added Per Day

As you can see it is a bit harder to tell what is going on.  some days of note include May 19th that had 19,858 items processed and uploaded, March 19th with 16,649, and January 13th with 13,338 new items added.  there are at least six other days with over 10,000 items processed and added to the digital collections.

If you take the number of items and spread them across the entire year you will get an average of 808 items loaded into the system per day.  Not bad at all. There were actually 165 days during 2016 that there weren’t any items added to the Digital Collections which leaves an impressive 200 days that new content was being processed and loaded. When you remove weekends you are left with content being added almost four days a week.

Another fun number to think about is that if we added an average of 808 items per day during 2016.  That’s 33.6 items added per hour during the day, for just about one item created and added every thirty seconds.

Items by Type

Next up is to take a look at what kind of items were added throughout the year.  I’m going to base these numbers off of the resource type field for each of the records.  If for some reason the item doesn’t have a resource type set then it will have a value of None.

Resource Type Item Count % of Total
text_newspaper 124,662 42.25%
text_report 56,279 19.07%
image_photo 42,203 14.30%
text_article 31,129 10.55%
video 12,238 4.15%
text_script 7,230 2.45%
sound 4,956 1.68%
image_drawing 4,097 1.39%
text_etd 2,763 0.94%
text 2,365 0.80%
text_leg 1,433 0.49%
image_postcard 1,193 0.40%
text_journal 886 0.30%
text_book 858 0.29%
text_pamphlet 778 0.26%
text_letter 541 0.18%
None 523 0.18%
text_clipping 174 0.06%
physical-object 144 0.05%
image_presentation 125 0.04%
text_legal 111 0.04%
text_review 107 0.04%
image_poster 89 0.03%
text_yearbook 47 0.02%
text_paper 37 0.01%
dataset 29 0.01%
image_map 22 0.01%
website 11 0.00%
image 11 0.00%
image_score 11 0.00%
image_artwork 8 0.00%
text_chapter 7 0.00%
collection 5 0.00%
text_poem 3 0.00%
interactive-resource 2 0.00%

I’ve taken the ten most commonly added item types, which account for over 97% of items added to the system and made a little pie chart out of them below.

Item by Type

Item by Type

As you can see the Digital Collections added a large number of newspapers over the past year.  Newspapers accounted for 124,662 or 43% of new items added to the system.  There were a large number of reports, photographs, and articles added as well.  Coming in at the fifth most added type are videos of which we added 12,238 new video items.

Items by Partner

Because we work with a number of partners here at UNT, across Texas, and into Oklahoma we upload content into the system associated with one partner. Throughout the year we added items to 154 different partner collections in the UNT Libraries’ Digital Collections.  I’ve presented the ten partners that contributed the most content to the collections in 2016.

Partner Partner Code Item Count Item Percentage
UNT Libraries Government Documents Department UNTGD 90,393 30.63%
UNT Libraries’ Special Collections UNTA 32,263 10.93%
Oklahoma Historical Society OKHS 25,786 8.74%
Texas Historical Commission THC 25,222 8.55%
UNT Libraries UNT 15,319 5.19%
Cuero Public Library CUERPU 5,901 2.00%
Nellie Pederson Civic Library CLIFNE 5,881 1.99%
Coleman Public Library CLMNPL 5,729 1.94%
Gladys Johnson Ritchie Library GJRL 4,850 1.64%
Abilene Christian University Library ACUL 4,359 1.48%

You can see that we had a strong year for the UNT Libraries’ Government Documents Department that added over 90,000 items to the system.  We have been ramping up the digitization activities for the UNT Libraries’ Special Collections and you can see the results with over 32,000 new items being added to the UNT Digital Library.

Closing

I think that’s just about it for the year overview of new content added to the UNT Libraries’ Digital Collections.  Next up I’m going to dig into some usage data that was collected from 2016 and see what that can tell us about last year.

I’m quite impressed with the amount of content that we added in 2016.  Adding 295,077 to the Digital Collections brought us to 1,751,015 items and 26,326,187 files (pages) of content in the systems.  I’m looking forward to 2017 and what it has in store for us.  At the rate we added content in 2016 I have a strong feeling that we will be passing the 2 million item mark.

If you have questions or comments about this post,  please let me know via Twitter.

LC Name Authority File Analysis: Where are the Commas?

This is the second in a series of blog posts on some analysis of the Name Authority File dataset from the Library of Congress. If you are interested in the setup of this work and bit more background take a look at the previous post.

The goal of this work is to better understand how personal and corporate names are formatted so that I can hopefully train a classifier to automatically identify a new name into either category.

In the last post we saw that commas seem to be important in differentiating between corporate and personal names.  Here is a graphic from the previous post.

Distribution of Commas in Name Strings

You can see that  the majority of personal names have commas 99% with a much smaller set of corporate names 14% having a comma present.

The next thing that I was curious about is does that placement of the comma in the name string reveal anything about the kind of name that it is?

How Many?

The first thing to look at is just counting the number of commas per name string.  My initial thought is that there are going to be more commas in the Corporate Names than in the Personal Names.  Let’s take a look.

Name Type Total Name Strings Names With Comma min 25% 50% 75% max mean std
Personal 6,362,262 6,280,219 1 1 1 2 8 1.309 0.471
Corporate 1,499,459 213,580 1 1 1 1 11 1.123 0.389

In looking at the overall statistics for the number of commas in the name strings indicate that there are more commas for the Personal Names than for the Corporate Names.  The Corporate Name with the most commas, in this case eleven is International Monetary Fund. Office of the Executive Director for Antigua and Barbuda, the Bahamas, Barbados, Belize, Canada, Dominica, Granada, Ireland, Jamaica, St. Kitts and Nevis, St. Lucia, and St. Vincent and the Grenadines you can view the name record here.

The Personal Name with the most commas had eight of them and is this name string Seu constante leitor, hum homem nem alto, nem baixo, nem gordo, nem magro, nem corcunda, nem ultra-liberal, que assistio no Beco do Proposito, e mora hoje no Cosme-Velho and you can view the name record here.

I can figure out the Corporate Name but needed a little help with the Personal Name so Google Translate to the rescue. From what I can tell that translate to His constant reader, a man neither tall, nor short, nor fat, nor thin, nor hunchback nor ultra-liberal, who attended in the Alley of the Purpose, and lives today in Cosme-Velho which I think is a pretty cool sounding Personal Name.

I was surprised when I made a histogram of the values and saw that it was actually pretty common for Personal Names to have more than one comma.   Very common actually.

Number of Commas in Personal Names

And while there are instances of more overall commas in Corporate Names, you generally are only going to see one comma per string.

Number of Commas in Corporate Names

Which Half?

The next thing that I wanted to look at is the placement of the first comma in the name string.

The numbers below represent the stats for just the name strings that contain a comma. The values of the number is the position of the first comma as a percentage of the overall number of characters in the name string.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1.9% 26.7% 36.4% 46.7% 95.7% 37.3% 13.8%
Corporate 213,580 2.2% 60.5% 76.9% 83.3% 99.0% 69.6% 19.3%

If we look at these as graphics we can see some trends a bit better.  Here is a histogram of the placement of the first comma in the Personal Name strings.

Comma Percentage Placement for Personal Name

It shows the bulk of the names with a comma have that comma occurring in the first half (50%) of the string.

This looks a bit different with the Corporate Names as you can see below.

Comma Percentage Placement for Corporate Name

You will see that the placement of that first comma trends very strongly to the right side of the graph, definitely over 50%.

Let’s be Absolute

Next up I wanted to take a look at the absolute distance from the first comma to the first space character in the name string.

My thought is that a Personal Name is going to have an overall lower absolute distance than the Corporate Names.  Two examples will hopefully help you see why.

For a Personal Name string like “Phillips, Mark Edward” the absolute distance from the first comma to the first space is going to be one.

For a Corporate Name string like “Worldwide Documentaries, Inc.” the absolute distances from the first comma to the first space is fourteen.

I’ll jump right to the graphs here.  First is the histogram of the Personal Name strings.

Personal Name: Absolute Distance Between First Space and First Comma

You can see that the vast majority of the name strings have an absolute distance from the first comma to the first space of 1 (that’s the value for the really tall bar).

If you compare this to the Corporate Name strings in graph below you will see some differences.

Corporate Name: Absolute Distance Between First Space and First Comma

Compared to the Personal Names, the Corporate Name graph has quite a bit more variety in the values.  Most of the values are higher than one.

If you are interested in the data tables they can provide some additional information.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 131 1.4 1.8
Corporate 213,580 1 18 27 37 270 28.9 17.4

Absolute Tokens

This next section is very similar to the previous but this time I am interested in the placement of the first comma in relation to the first token in the string.  I have a feeling that it will be similar to what we saw for the absolute first space distance that we saw above but should normalize the data a bit because we are dealing with tokens instead of characters.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 17 1.1 0.3
Corporate 213,580 1 3 4 6 35 4.8 2.4

And now to round things out with graphs of both of the datasets for the absolute distance from first comma to first token.

Personal Name: Absolute Distance Between First Token and First Comma

Just as we saw in the section above the Personal Name strings will have commas that are placed right next to the first token in the string.

Corporate Name: Absolute Distance Between First Token and First Comma

The Corporate Names are a bit more distributed away from the first token.

Conclusion

Some observations that I have now that I’ve spent a little more time with the LC Name Authority File while working on this post and the previous one.

First, it appears that the presence of a comma in a name string is a very good indicator that it is going to be a Personal Name.  Another thing is that if the first comma occurs in the first half of the name string it is most likely going to be a Personal Name and if it occurs in the second half of the string it is most likely to be a Corporate Name. Finally the absolute distance from the first comma to either the first space or from the first token is a good indicator of it the string is a Personal Name or a Corporate Name.

If you have questions or comments about this post,  please let me know via Twitter.

First step analysis of Library of Congress Name Authority File

For a class this last semester I spent a bit of time working with the Library of Congress Name Authority File (LC-NAF) that is available here in a number of downloadable formats.

After downloading the file and extracting only the parts I was interested in, I was left with 7,861,721 names to play around with.

The resulting dataset has three columns, the unique identifier for a name, the category of either PersonalName or CorporateName and finally the authoritative string for the given name.

Here is an example set of entries in the dataset.

<http://id.loc.gov/authorities/names/no2015159973> PersonalName Thomas, Mike, 1944-
<http://id.loc.gov/authorities/names/n00004656> PersonalName Gutman, Sharon A.
<http://id.loc.gov/authorities/names/no99024929> PersonalName Hornby, Lester G. (Lester George), 1882-1956
<http://id.loc.gov/authorities/names/n86050616> PersonalName Borisi\uFE20u\uFE21k, G. N. (Galina Nikolaevna)
<http://id.loc.gov/authorities/names/no2011132525> PersonalName Cope, Samantha
<http://id.loc.gov/authorities/names/nr92002092> PersonalName Okuda, Jun
<http://id.loc.gov/authorities/names/n2008028760> PersonalName Brandon, Wendy
<http://id.loc.gov/authorities/names/no2008088468> PersonalName Gminder, Andreas
<http://id.loc.gov/authorities/names/nb2013005548> CorporateName Archivo Hist\u00F3rico Provincial de Granada
<http://id.loc.gov/authorities/names/n84081250> PersonalName Mermier, Pierre-Marie, 1790-1862

I was interested in how Personal and Corporate names differ across the whole LC-NAF file and to see if there were any patterns that I could tease out. The final goal if I could train a classifier to automatically classify a name string into either PersonalName or CorporateName classes.

But more on that later.

Personal or Corporate Name

The first thing to take a look at in the dataset is the split between PersonalName and CorporateName strings.

LC-NAF Personal / Corporate Name Distribution

As you can see the majority of names in the LC-NAF are personal names with 6,361,899 (81%) and just 1,499,822 (19%) being corporate names.

Commas

One of the common formatting rules in library land is to invert names so that they are in the format of Last, First.  This is useful when sorting names as it will group names together by family name instead of ordering them by the first name.  Because of this common rule I expected that the majority of the personal names will have a comma.  I wasn’t sure what number of the corporate names would have a comma in them.

Distribution of Commas in Name Strings

In looking at the graph above you can see that it is true that the majority of personal names have commas 6,280,219 (99%) with a much smaller set of corporate names 213,580 (14%) having a comma present.

Periods

I next took a look at periods in the name string.  I wasn’t sure exactly what I would find in doing this so my only prediction was that there would be fewer name strings that have periods present.

Distribution of Periods in Name Strings

This time we see a bit different graph.  Personal names have1,587,999 (25%) instances with periods while corporate names had 675,166 (45%) instances with periods.

Hyphens

Next up to look at are hyphens that occur in name strings.

Distribution of Hyphens in Name Strings

There are 138,524 (9%) of corporate names with hyphens and 2,070,261 (33%) of personal names with hyphens present in the name string.

I know that there are many name strings in the LC-NAF that have dates in the format of yyyy-yyyy, yyyy-, or -yyyy. Let’s see how many name strings have a hyphen when we remove those.

Date and Non-Date Hyphens

This time we look at the instances that just have hyphens and divide them into two categories. “Date Hyphens” and “Non-Date Hyphens”.  You can see that most of the corporate name strings have hyphens that are not found in relation to dates.  The personal names on the other hand have the majority of hyphens occurring in date strings.

Parenthesis

The final punctuation characters we will look at are parenthesis.

Distribution of Parenthesis in Name Strings

We see that most names overall don’t have parenthesis in them.  There are 472,254 (31%) name strings in the dataset with parenthesis. There are also 541,087 (9%) of personal name strings that have parenthesis.

This post is the first in a short series that takes a look at the LC Name Authority File to get a better understanding of how names in library metadata have been constructed over the years.

If you have questions or comments about this post,  please let me know via Twitter.

Removing leading or trailing white rows from images

At the library we are working on a project to digitize television news scripts from KXAS, the NBC affiliate from Fort Worth.  These scripts were read on the air during the broadcast and are a great entry point into a vast collection of film and tape collection that is housed at the UNT Libraries.

To date we’ve digitized and made available over 13,000 of these scripts.

In looking at workflows we noticed that sometimes the scanners and scanning software would leave several rows of white pixels at the leading or trailing end of the image.

It is kind of hard to see that because this page has a white background so I’ll include a closeup for you.  I put a black border around the image to help the white stand out a bit.

Detail of leading white edge

One problem with these white rows is that they happen some of the time but not all of the time.  Another problem is that the number of white lines isn’t uniform, it will vary from image to image when it occurs. The final problem is that it is not consistently at the top or at the bottom of the image. It could be at the top, the bottom, or both.

Probably the best solution to this problem is going to be getting different control software for the scanners that we are using.  But that won’t help the tens of thousands of these image that we have already scanned and that we need to process.

Trimming white line

Manual

There are a number of ways that we can approach this task.  First we can do what we are currently doing which is to have our imaging students open each image and manually crop them if needed.  This is very time consuming.

Photoshop

There is a tool in photoshop that can sometimes be useful for this kind of work.  It is the “Trim” tool.  Here is the dialog box you get when you select this tool.

Photoshop Trim Dialog Box

This allows you to select if you want to remove from the top of bottom (or left or right).  The tool wants you to select a place on the image to grab a color sample and then it will try and trim off rows of the image that match that color.

Unfortunately this wasn’t an ideal solution because you still had to know if you needed to crop from the top or bottom.

Imagemagick

Imagemagick tools have an option called “trim” that does a very similar thing to the Photoshop Trim tool.  It is well described on this page.

By default the trim option here will remove edges around the whole image that match a pixel value.  You are able to adjust the specificity of the pixel color if you add a little blur but it isn’t an ideal solution either.

A little Python

My next thing to look at was to use a bit of Python to identify the number of rows in an image that are white.

With this script you feed it an image filename and it will return the number of rows from the top of the image that are at least 90% white.

The script will convert the incoming image into a grayscale image, and then line by line count the number of pixels that have a pixel value greater than 225 (so a little white all the way to white white).  It will then count a line as “white” if more than 90% of the pixels on that line have a value of greater than 225.

Once the script reaches a row that isn’t white, it ends and returns the number of white lines it has found.  If the first row of the image is not a white row it will immediately return with a value of 0.

The next thing to go back to Imagemagick but this time use the -chop flag to remove the number of rows from the image that the previous script specified.

mogrify -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

We tell mogrify to chop off the first fifteen rows of the image with the 0x15 value.  This means an offset of zero and then remove fifteen rows of pixels.

Here is what the final image looks like without the leading white edge.

Corrected image

In order to count the rows from the bottom you have to adjust the script in one place.  Basically you reverse the order of the rows in the image so  you work from the bottom first.  This allows you to apply the same logic to finding white rows as we did before.

You have to adjust the Imagemagick command as well so that you are chopping the rows from the bottom of the image and not the top.  You do this by specifying -gravity in the command.

mogrify -gravity bottom -chop 0x15 UNTA_AR0787-010-1959-06-14-07_01.tif

With a little bit of bash scripting these scripts can be used to process a whole folder full of images and instructions can be given to only process images that have rows that need to be removed.

This combination of a small Python script to gather image information and then passing that info on to Imagemagick has been very useful for this project and there are a number of other ways that this same pattern can be used for processing images in a digital library workflow.

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – Curator Intent

This is another post in a series that I’ve been doing to compare the End of Term Web Archives from 2008 and 2012.  If you look back a few posts in this blog you will see some other analysis that I’ve done with the datasets so far.

One thing that I am interested in understanding is how well the group that conducted the EOT crawls did in relation to what I’m calling “curator intent”.  For both the EOT archives suggested seeds were collected using instances of the URL Nomination Tool hosted by the UNT Libraries. A combination of bulk lists of seeds URLs collected by various institutions and individuals were combined individual nominations made by users of the nomination tool.  The resulting lists were used as seed lists for the crawlers that were used to harvest the EOT archives.  In 2008 there were four institutions that crawled content,  the Internet Archive (IA), Library of Congress (LOC), California Digital Library (CDL), and the UNT Libraries (UNT).  In 2012 CDL was not able to do any crawling so just IA, LOC and UNT crawled.  UNT and LOC had limited scope in what they were interested in crawling while CDL and IA took the entire seed list and used that to feed their crawlers.  The crawlers were scoped very wide so that they would get as much content as they could, so the nomination seeds were used as starting places and we allowed the crawlers to go to all subdomains and paths on those sites as well as to areas that the sites linked to on other domains.

During the capture period there wasn’t consistent quality control performed for the crawls, we accepted what we could get and went on with our business.

Looking back at the crawling that we did I was curious of two things.

  1. How many of the domain names from the nomination tool were not present in the EOT archive.
  2. How many domains from .gov and .mil were captured but not explicitly nominated.

EOT2008 Nominated vs Captured Domains.

In the 2008 nominated URL list form the URL Nomination Tool there were a total of 1,252 domains with 1,194 being either .gov or .mil.  In the EOT2008 archive there were a total of 87,889 domains and 1,647 of those were either .gov or .mil.

There are 943 domains that are present in both the 2008 nomination list and the EOT2008 archive.  There are 251 .gov or .mil domains from the nomination list that were not present in the EOT2008 archive. There are 704 .gov or .mil domains that are present in the EOT2008 archive but that aren’t present in the 2008 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2008 .gov and .mil Nominated and Archived

2008 .gov and .mil Nominated and Archived

Of those 704 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs
womenshealth.gov 168,559
dccourts.gov 161,289
acquisition.gov 102,568
america.gov 89,610
cfo.gov 83,846
kingcounty.gov 61,069
pa.gov 42,955
dc.gov 28,839
inl.gov 23,881
nationalservice.gov 22,096
defenseimagery.mil 21,922
recovery.gov 17,601
wa.gov 14,259
louisiana.gov 12,942
mo.gov 12,570
ky.gov 11,668
delaware.gov 10,124
michigan.gov 9,322
invasivespeciesinfo.gov 8,566
virginia.gov 8,520
alabama.gov 6,709
ct.gov 6,498
idaho.gov 6,046
ri.gov 5,810
kansas.gov 5,672
vermont.gov 5,504
arkansas.gov 5,424
wi.gov 4,938
illinois.gov 4,322
maine.gov 3,956

I see quite a few state and local governments that have a .gov domain which was out of scope of the EOT project but there are also a number of legitimate domains in the list that were never nominated.

EOT2012 Nominated vs Captured Domains.

In the 2012 nominated URL list form the URL Nomination Tool there were a total of 1,674 domains with 1,551 of those being .gov or .mil domains.  In the EOT2012 archive there were a total of 186,214 domains and 1,944 of those were either .gov or .mil.

There are 1,343 domains that are present in both the 2008 nomination list and the EOT2012 archive.  There are 208 .gov or .mil domains from the nomination list that were not present in the EOT2012 archive. There are 601 .gov or .mil domains that are present in the EOT2012 archive but that aren’t present in the 2012 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2012 .gov and .mil Domains Nominated and Archived

2012 .gov and .mil Domains Nominated and Archived

Of those 601 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs
gao.gov 952,654
vaccines.mil 856,188
esgr.mil 212,741
fdlp.gov 156,499
copyright.gov 70,281
congress.gov 40,338
openworld.gov 31,929
americaslibrary.gov 18,415
digitalpreservation.gov 17,327
majorityleader.gov 15,931
sanjoseca.gov 10,830
utah.gov 9,387
dc.gov 9,063
nyc.gov 8,707
ng.mil 8,199
ny.gov 8,185
wa.gov 8,126
in.gov 8,011
vermont.gov 7,683
maryland.gov 7,612
medicalmuseum.mil 7,135
usbg.gov 6,724
virginia.gov 6,437
wv.gov 6,188
compliance.gov 6,181
mo.gov 6,030
idaho.gov 5,880
nv.gov 5,709
ct.gov 5,628
ne.gov 5,414

Again there are a number of state and local government domains present in the list but up at the top we see quite a few URLs harvested from domains that are federal in nature and would fit into the collection scope for the EOT project.

How did we do?

The way that seed lists for the nomination tool were collected for the EOT2008 and EOT2012 nomination lists introduced a bit of dirty data.  We would need to look a little deeper to see what the issues were with these. Some things that come to mind are that we got seeds from domains that existed prior to 2008 or 2012 but that didn’t exist when we were harvesting.  Also there could have been typos in the URLs that were nominated so we never grabbed the suggested content.  We might want to introduce a validate process for the nomination tool that let’s us know what that status of a URL in a project is at a given point so that we can at least have some sort of record.

 

 

 

13% to 10%

Comparing Web Archives: EOT2008 and EOT2012 – What disappeared

This is the fourth post in a series that looks at the End of Term Web Archives captured in 2008 and 2012.  In previous posts I’ve looked at the when, what, and where of these archives.  In doing so I pulled together the domain names from each of the archives to compare them.

My thought was that I could look at which domains had content in the EOT2008 or EOT2012 and compare these domains to get some very high level idea of what content was around in 2008 but was completely gone in 2012.  Likewise I could look at new content domains that appeared since 2008.  For this post I’m limiting my view to just the domains that end in .gov or .mil because they are generally the focus of these web archiving projects.

Comparing EOT2008 and EOT2012

The are 1,647 unique domain names in the EOT2008 archive and 1,944 unique domain names in the EOT2012 archive, which is an increase of 18%. Between the two archives there are 1,236 domain names that are common.  There are 411 domains that exist in the EOT2008 that are not present in EOT2012, and 708 new domains in EOT2012 that didn’t exist in EOT2008.

Domains in EOT2008 and E0T2012

Domains in EOT2008 and E0T2012

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  When you look at the URLs in the 411 domains that are present in EOT2008 and missing in EOT2012 you get 3,784,308 which is just 2% of the total number of URLs.  When you look at the EOT2012 domains that were only present in 2012 compared to 2008 you see 5,562,840 URLs (3%) that were harvested from domains that only existed in the EOT2012 archive.

The thirty domains with the most URLs captured for them that were present in the EOT2008 collection that weren’t present in EOT2012 are listed in the table below.

Domain Count
geodata.gov 812,524
nifl.gov 504,910
stat-usa.gov 398,961
tradestatsexpress.gov 243,729
arnet.gov 174,057
acqnet.gov 171,493
dccourts.gov 161,289
web-services.gov 137,202
metrokc.gov 132,210
sdi.gov 91,887
davie-fl.gov 88,123
belmont.gov 87,332
aftac.gov 84,507
careervoyages.gov 57,192
women-21.gov 56,255
egrpra.gov 54,775
4women.gov 45,684
4woman.gov 42,192
nypa.gov 36,099
nhmfl.gov 27,569
darpa.gov 21,454
usafreedomcorps.gov 18,001
peacecore.gov 17,744
californiadesert.gov 15,172
arpa.gov 15,093
okgeosurvey1.gov 14,595
omhrc.gov 14,594
usafreedomcorp.gov 14,298
uscva.gov 13,627
odci.gov 12,920

The thirty domains with the most URLs from EOT2012 that weren’t present in EOT2012.

Domain Count
militaryonesource.mil 859,843
consumerfinance.gov 237,361
nrd.gov 194,215
wh.gov 179,233
pnnl.gov 132,994
eia.gov 112,034
transparency.gov 109,039
nationalguard.mil 108,854
acus.gov 93,810
404.gov 82,409
savingsbondwizard.gov 76,867
treasuryhunt.gov 76,394
fedshirevets.gov 75,529
onrr.gov 75,484
veterans.gov 75,350
broadbandmap.gov 72,889
saferproducts.gov 65,387
challenge.gov 63,808
healthdata.gov 63,105
marinecadastre.gov 62,882
fatherhood.gov 62,132
edpubs.gov 58,356
transportationresearch.gov 58,235
cbca.gov 56,043
usbonds.gov 55,102
usbond.gov 54,847
phe.gov 53,626
ussavingsbond.gov 53,563
scienceeducation.gov 53,468
mda.gov 53,010

Shared domains that changed

There were a number of domains (1,236) that are present in both the EOT2008 and EOT2012 archives.  I thought it would be interesting to compare those domains and see which ones changed the most.  Below are the fifty shared domains that changed the most between EOT2008 and EOT2012.

Domain EOT2008 EOT2012 Change Absolute Change % Change
house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162%
senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97%
gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55%
nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76%
nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54%
navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739%
usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65%
loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50%
hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85%
osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073%
af.mil 230,920 2,067,812 1,836,892 1,836,892 795%
ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78%
lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85%
usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54%
congress.gov 1,554,199 40,338 -1,513,861 1,513,861 -97%
noaa.gov 5,317,872 3,985,633 -1,332,239 1,332,239 -25%
epa.gov 1,628,517 327,810 -1,300,707 1,300,707 -80%
uscourts.gov 1,484,240 184,507 -1,299,733 1,299,733 -88%
dol.gov 1,387,724 88,557 -1,299,167 1,299,167 -94%
census.gov 1,604,505 328,014 -1,276,491 1,276,491 -80%
dot.gov 1,703,935 554,325 -1,149,610 1,149,610 -67%
usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99%
doe.gov 1,164,955 268,694 -896,261 896,261 -77%
vaccines.mil 5,665 856,188 850,523 850,523 15,014%
fdlp.gov 991,747 156,499 -835,248 835,248 -84%
uspto.gov 980,215 155,428 -824,787 824,787 -84%
bts.gov 921,756 130,730 -791,026 791,026 -86%
cdc.gov 1,014,213 264,500 -749,713 749,713 -74%
lbl.gov 743,472 4,080 -739,392 739,392 -99%
faa.gov 945,446 206,500 -738,946 738,946 -78%
treas.gov 838,243 99,411 -738,832 738,832 -88%
fema.gov 903,393 172,055 -731,338 731,338 -81%
clinicaltrials.gov 919,490 196,642 -722,848 722,848 -79%
army.mil 2,228,691 2,936,308 707,617 707,617 32%
nsf.gov 760,976 65,880 -695,096 695,096 -91%
prc.gov 740,176 75,682 -664,494 664,494 -90%
doc.gov 823,825 173,538 -650,287 650,287 -79%
fueleconomy.gov 675,522 79,943 -595,579 595,579 -88%
nbii.gov 577,708 391 -577,317 577,317 -100%
defense.gov 687 575,776 575,089 575,089 83,710%
usajobs.gov 3,487 551,217 547,730 547,730 15,708%
sandia.gov 736,032 210,429 -525,603 525,603 -71%
nps.gov 706,323 191,102 -515,221 515,221 -73%
defenselink.mil 502,023 1,868 -500,155 500,155 -100%
fws.gov 625,180 132,402 -492,778 492,778 -79%
ssa.gov 609,784 125,781 -484,003 484,003 -79%
archives.gov 654,689 175,585 -479,104 479,104 -73%
fnal.gov 575,167 1,051,926 476,759 476,759 83%
change.gov 486,798 24,820 -461,978 461,978 -95%
buyusa.gov 490,179 37,053 -453,126 453,126 -92%

Only 11 of the 50 (22%) resulted in more content harvested in EOT2012 than EOT2012.

Of the eleven domains that had more content harvested for them in EOT2012 there were five navy.mil, osd.mil, vaccines.mil, defense.gov, and usajobs.gov that increased by over 1,000% in the amount of content.  I don’t know if this is necessarily a result in an increase in attention to these sites, more content on the sites, or a different organization of the sites that made them easier to harvest.  I suspect it is some combination of all three of those things.

Summary

It should be expected that there are going to be domains that come into and go out of existence on a regular basis in a large web space like the federal government.  One of the things that I think is rather challenging to identify is a list of domains that were present at one given time within an organization.  For example “what domains did the federal government have in 1998?”.  It seems like a way to come up with that answer is to use web archives. We see based on the analysis in this post that there are 411 domains that were present in 2008 that we weren’t able to capture in 2012.  Take a look at that list of the top thirty,  did you recognize any of those? How many other initiatives, committees, agencies, task forces, and citizen education portals existed at one point that are now gone?

If you have questions or comments about this post,  please let me know via Twitter.

Comparing Web Archives: EOT2008 and EOT2012 – Where

This post carries on in the analysis of the End of Term web archives for 2008 and 2012. Previous posts in this series discuss when content was harvested and what kind of content was harvested and included in the archives.

In this post we will look at where content came from, specifically the data held in the top level domains, domain names and sub-domain names.

Top Level Domains

The first thing to look at is the top level domains for all of the URLs in the CDX files.

In the EOT2008 archive there are a total of 241 unique TLDs.  In the EOT2012 archive there are a total of 251 unique TLDs.  This is a modest increase of 4.15% from EOT2008 to EOT2012.

The EOT2008 and EOT2012 archives share 225 TLDs between the two archives.  There are 16 TLDs that are unique to the EOT2008 archive and 26 TLDs that are unique to the EOT2012 archive.

TLDs unique to EOT2008

Unique to 2008 URLs from TLD
null 18,772
www 583
yu 357
labs 20
webteam 16
cg 10
security 8
ssl 8
b 8
css 7
web 6
dev 4
education 4
misc 2
secure 2
campaigns 2

TLDs unique to EOT2012

Unique to 2012 URLs from TLD
whois 17,500
io 7,935
pn 987
sy 541
lr 478
so 418
nr 363
tf 291
xxx 258
re 186
xn--p1ai 171
bi 153
dm 120
tel 78
ck 65
ax 64
sx 54
tg 50
ki 48
gg 25
kn 25
gp 24
pm 20
fk 18
cf 7
wf 3

I believe that the “null” TLD from EOT2008 is an artifact of the crawling process and possibly represents rows in the CDX file that correspond to metadata records in the warc/arcs from 2008.  I will have to do some digging to confirm.

Change in TLD

Next up we take a look at the 225 TLDs that are shared between the archives. First up are the fifteen most changed based on the increase or decrease in the number of URLs from that TLD

TLD eot2008 eot2012 Change Absolute Change % change
com 7,809,711 45,594,482 37,784,771 37,784,771 483.8%
gov 137,829,050 109,141,353 -28,687,697 28,687,697 -20.8%
mil 3,555,425 16,223,861 12,668,436 12,668,436 356.3%
net 653,187 9,269,406 8,616,219 8,616,219 1319.1%
edu 3,552,509 2,442,626 -1,109,883 1,109,883 -31.2%
int 135,939 685,168 549,229 549,229 404.0%
uk 70,262 594,020 523,758 523,758 745.4%
ly 95 503,457 503,362 503,362 529854.7%
org 5,108,645 5,588,750 480,105 480,105 9.4%
us 840,516 474,156 -366,360 366,360 -43.6%
co 2,839 211,131 208,292 208,292 7336.8%
be 4,019 203,178 199,159 199,159 4955.4%
jp 23,896 220,602 196,706 196,706 823.2%
me 35 182,963 182,928 182,928 522651.4%
tv 10,373 191,736 181,363 181,363 1748.4%

Interesting is the change in the first two.  There was an increase of over 37 million URLs (484%) for the com TDL between EOT2008 and EOT2012.  There was also a decrease (-21%) or over 28 million URLs for the gov TLD.  The mil TLD also increased by 356% between the EOT2008 and EOT2012 harvests with an increase of over 12 million URLs.

You can see that .ly and .me increased by some serious percentage,  529,855% and 522,651% respectively.

Taking a look at just the percent of change, here are the five most changed based on that percentage

TLD eot2008 eot2012 Change Absolute Change % change
ly 95 503,457 503,362 503,362 529854.7%
me 35 182,963 182,928 182,928 522651.4%
gl 129 49,733 49,604 49,604 38452.7%
gd 9 3,273 3,264 3,264 36266.7%
cat 43 11,703 11,660 11,660 27116.3%

I have a feeling that at the majority of the ly, me, gl, and gd TLD content came in as redirect URLs from link shortening services.

Domain Names

There are 87,889 unique domain names in the EOT2008 archive, this increases dramatically in the EOT2012 archive to 186,214 which is an increase of 118% in the number of domain names.

There are 30,066 domain names that are shared between the two archives.  There are 57,823 domain names that are unique to the EOT2008 archive and 156.148 domain names that are unique to the EOT2012 archive.

Here is a table showing thirty of the domains that were only present in the EOT2008 archive ordered by the number of URLs from that domain.

TLD Count
geodata.gov 812,524
nifl.gov 504,910
stat-usa.gov 398,961
tradestatsexpress.gov 243,729
arnet.gov 174,057
acqnet.gov 171,493
dccourts.gov 161,289
meish.org 147,261
web-services.gov 137,202
metrokc.gov 132,210
sdi.gov 91,887
davie-fl.gov 88,123
belmont.gov 87,332
aftac.gov 84,507
careervoyages.gov 57,192
women-21.gov 56,255
egrpra.gov 54,775
4women.gov 45,684
4woman.gov 42,192
nypa.gov 36,099
secure-banking.com 33,059
nhmfl.gov 27,569
darpa.gov 21,454
usafreedomcorps.gov 18,001
peacecore.gov 17,744
californiadesert.gov 15,172
federaljudgesassoc.org 15,126
arpa.gov 15,093
transportationfortomorrow.org 14,926
okgeosurvey1.gov 14,595

Here is the same kind of table but this time for the EOT2012 dataset.

TLD Count
militaryonesource.mil 859,843
yfrog.com 682,664
staticflickr.com 640,606
akamaihd.net 384,769
4sqi.net 350,707
foursquare.com 340,492
adf.ly 334,767
pinterest.com 244,293
consumerfinance.gov 237,361
nrd.gov 194,215
wh.gov 179,233
t.co 175,033
youtu.be 172,301
sndcdn.com 161,039
pnnl.gov 132,994
eia.gov 112,034
transparency.gov 109,039
nationalguard.mil 108,854
acus.gov 93,810
nrsc.org 85,925
mzstatic.com 84,202
404.gov 82,409
savingsbondwizard.gov 76,867
treasuryhunt.gov 76,394
mynextmove.org 75,927
fedshirevets.gov 75,529
onrr.gov 75,484
veterans.gov 75,350
broadbandmap.gov 72,889
ntm-a.com 71,126

Those are pretty long tables but I think they start to point at some interesting things from this analysis.  The domains that were present and harvested in 2008 and that weren’t harvested in 2012.  In looking at the list, some of them (metrokc.gov, davie-fl.gov, okgeosurvey1.gov) were most likely out of scope for “Federal Web” but got captured because of the gov TLD.

In the EOT2012 list you start to see artifacts from an increase in attention to social media site capture for the EOT2012 project.  Sites like yfrog.com, staticflickr.com, adf.ly, t.co, youtu.be, foursquare.com, pintrest.com probably came from that increased attention.

Here is a list of the twenty most changed domains from EOT2008 to EOT2012.  This number is based on the absolute change in the number of URLs captured for each of the archives.

Domain EOT2008 EOT2012 Change Abolute Change % Change
house.gov 13,694,187 35,894,356 22,200,169 22,200,169 162%
facebook.com 11,895 7,503,640 7,491,745 7,491,745 62,982%
dvidshub.net 1,097 5,612,410 5,611,313 5,611,313 511,514%
senate.gov 5,043,974 9,924,917 4,880,943 4,880,943 97%
gpo.gov 8,705,511 3,888,645 -4,816,866 4,816,866 -55%
nih.gov 5,276,262 1,267,764 -4,008,498 4,008,498 -76%
nasa.gov 6,693,542 3,063,382 -3,630,160 3,630,160 -54%
navy.mil 94,081 3,611,722 3,517,641 3,517,641 3,739%
usgs.gov 4,896,493 1,690,295 -3,206,198 3,206,198 -65%
loc.gov 5,059,848 7,587,179 2,527,331 2,527,331 50%
flickr.com 157,155 2,286,890 2,129,735 2,129,735 1,355%
youtube.com 346,272 2,369,108 2,022,836 2,022,836 584%
hhs.gov 2,361,866 366,024 -1,995,842 1,995,842 -85%
osd.mil 180,046 2,111,791 1,931,745 1,931,745 1,073%
af.mil 230,920 2,067,812 1,836,892 1,836,892 795%
ed.gov 2,334,548 510,413 -1,824,135 1,824,135 -78%
granicus.com 782 1,785,724 1,784,942 1,784,942 228,253%
lanl.gov 2,081,275 309,007 -1,772,268 1,772,268 -85%
usda.gov 2,892,923 1,324,049 -1,568,874 1,568,874 -54%
googleusercontent.com 2 1,560,457 1,560,455 1,560,455 78,022,750%

You see big increases in facebook.com (+62,982%), flickr.com (+1,355%), youtube.com (584%) and googleusercontent.com (78,022,750%) in content from EOT2008 to EOT2012.

Other increases that are notable include dvidshub.net which is the domain for a site called Defense Video & Imagery Distribution System that increased by 511,514%, navy.mil (3,739%), osd.mil (1,073%), af.mil (795%).  I like to think this speaks to a desired increase in attention to .mil content in the EOT2012 project.

Another domain that stands out to me is granicus.com which I was unaware of but after a little looking turns out to be one of the big cloud service providers for the federal government (or at least it was according to the EOT2012 dataset).

.gov and .mil subdomains

The last piece I wanted to look at related to domain names was to see what sort of changes there were in the gov and mil portions of the EOT2008 and EOT2012 crawls.  This time I wanted to look at the subdomains.

I filtered my dataset a bit so that I was only looking at the .mil and .gov content.

In the EOT2008 archive there were a total of 16,072 unique subdomains and in EOT2012 there were 22,477 subdomains.  This is an increase of 40% between the two archive projects.

The EOT2008 has 5,371 subdomains unique to its holdings and EOT2012 has 11,776 unique subdomains.

Subdomains that had the most content (based on URLs downloaded) and which are only present in EOT2008 are presented below.  (Limited to the top 30)

EOT2008 Subdomain Count
gos2.geodata.gov 809,442
boucher.house.gov 772,759
kendrickmeek.house.gov 685,368
citizensbriefingbook.change.gov 446,632
stat-usa.gov 305,936
nifl.gov 285,833
scidac-new.ca.sandia.gov 247,451
tradestatsexpress.gov 243,729
hpcf.nersc.gov 221,626
gopher.info.usaid.gov 219,051
novel.nifl.gov 218,962
dli2.nsf.gov 206,932
contractorsupport.acf.hhs.gov 188,841
pnwin.nbii.gov 188,591
faq.acf.hhs.gov 184,212
ccdf.acf.hhs.gov 182,606
arnet.gov 174,018
regulations.acf.hhs.gov 171,762
acqnet.gov 171,493
dccourts.gov 161,289
employers.acf.hhs.gov 139,141
search.info.usaid.gov 137,816
web-services.gov 137,202
earth2.epa.gov 136,441
cjtf7.army.mil 134,507
ncweb-north.wr.usgs.gov 134,486
opre.acf.hhs.gov 133,689
childsupportenforcement.acf.hhs.gov 132,023
modis-250m.nascom.nasa.gov 128,810
casd.uscourts.gov 124,146

Here is the same sort of data for the EOT2012 dataset

EOT2012 Subdomain Count
militaryonesource.mil 698,035
uscodebeta.house.gov 387,080
democrats.foreignaffairs.house.gov 312,270
gulflink.fhpr.osd.mil 262,246
coons.senate.gov 257,721
democrats.energycommerce.house.gov 243,341
consumerfinance.gov 225,815
dcmo.defense.gov 217,255
nrd.gov 187,267
wh.gov 179,103
usaxs.xray.aps.anl.gov 178,298
democrats.budget.house.gov 175,109
democrats.edworkforce.house.gov 162,077
apps.militaryonesource.mil 157,144
naturalresources.house.gov 155,918
purl.fdlp.gov 154,718
media.dma.mil 137,581
algreen.house.gov 131,388
democrats.transportation.house.gov 129,345
democrats.naturalresources.house.gov 124,808
hanabusa.house.gov 123,794
pitts.house.gov 122,402
visclosky.house.gov 122,223
garamendi.house.gov 114,221
vault.fbi.gov 113,873
green.house.gov 113,040
sewell.house.gov 112,973
levin.house.gov 111,971
eia.gov 111,889
hahn.house.gov 111,024

This last table is a little long,  but I found the data pretty interesting to look at.   The table below shows the biggest change for domains and subdomains that were shared between the EOT2008 and EOT2012 archives. I’ve included the top forty entries for that list.

Subdomain/Domain EOT2008 EOT2012 Change Absolute Change % Change
listserv.access.gpo.gov 2,217,565 7,487 -2,210,078 2,210,078 -100%
carter.house.gov 1,898,462 29,680 -1,868,782 1,868,782 -98%
catalog.gpo.gov 1,868,504 34,040 -1,834,464 1,834,464 -98%
loc.gov 63,534 1,875,264 1,811,730 1,811,730 2,852%
gpo.gov 52,427 1,796,925 1,744,498 1,744,498 3,327%
bensguide.gpo.gov 90,280 1,790,017 1,699,737 1,699,737 1,883%
edocket.access.gpo.gov 1,644,578 7,822 -1,636,756 1,636,756 -100%
nws.noaa.gov 103,367 1,676,264 1,572,897 1,572,897 1,522%
navair.navy.mil 220 1,556,320 1,556,100 1,556,100 707,318%
congress.gov 1,525,467 356 -1,525,111 1,525,111 -100%
cha.house.gov 1,366,520 109,192 -1,257,328 1,257,328 -92%
usbg.gov 1,026,360 6,724 -1,019,636 1,019,636 -99%
dol.gov 1,052,335 41,909 -1,010,426 1,010,426 -96%
resourcescommittee.house.gov 1,008,655 335 -1,008,320 1,008,320 -100%
calvert.house.gov 20,530 1,014,416 993,886 993,886 4,841%
fdlp.gov 989,415 1,554 -987,861 987,861 -100%
lcweb2.loc.gov 466,623 1,451,708 985,085 985,085 211%
cramer.house.gov 1,011,872 60,879 -950,993 950,993 -94%
ed.gov 1,141,069 241,165 -899,904 899,904 -79%
vaccines.mil 5,638 856,113 850,475 850,475 15,085%
clinicaltrials.gov 919,362 193,158 -726,204 726,204 -79%
army.mil 4,831 725,934 721,103 721,103 14,927%
boehner.house.gov 7,472 695,625 688,153 688,153 9,210%
nces.ed.gov 702,644 31,922 -670,722 670,722 -95%
prc.gov 739,849 75,682 -664,167 664,167 -90%
navy.mil 1,481 654,254 652,773 652,773 44,077%
house.gov 818,095 172,066 -646,029 646,029 -79%
fueleconomy.gov 675,522 79,943 -595,579 595,579 -88%
fema.gov 636,005 53,321 -582,684 582,684 -92%
frwebgate.access.gpo.gov 621,361 55,097 -566,264 566,264 -91%
siadapp.dmdc.osd.mil 43 559,076 559,033 559,033 1,300,077%
fdsys.gpo.gov 548,618 28 -548,590 548,590 -100%
tiger.census.gov 549,046 750 -548,296 548,296 -100%
rs6.loc.gov 550,489 6,695 -543,794 543,794 -99%
bennelson.senate.gov 16,203 553,698 537,495 537,495 3,317%
crapo.senate.gov 28,569 540,928 512,359 512,359 1,793%
eia.doe.gov 508,675 1,629 -507,046 507,046 -100%
epa.gov 623,457 117,794 -505,663 505,663 -81%
defenselink.mil 502,006 1,866 -500,140 500,140 -100%
access.gpo.gov 472,373 3,110 -469,263 469,263 -99%

I find this table interesting for a number of reasons.  First you see quite a bit more decline that I have seen in my other tables like this.  In fact 26 of the 40 subdomains/domains (54%) on this list decreased from EOT2008 to EOT2012.

In looking at the list as well I can see some of the sites that I can see the transition of some of the sites within GPO, for example access.gpo.gov going down 90% in captured content, fdsys.gpo.gov going down by 94%, bensguide.gpo.gov increasing by 1,883%.

Wrapping Up

I like to think that it helps to justify some of the work that the partners of the End of Term project are committing to the project when you see that there are large numbers of domains and subdomains that existed in 2008 but that weren’t crawled again in 2012 (and we can only assume they weren’t around in 2012).

There are a few more things I want to look at in this work so stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.