Monthly Archives: February 2016

How many of the EOT2008 PDF files were harvested in EOT2012

In my last post I started looking at some of the data from the End of Term 2012 Web Archive snapshot that we have at the UNT Libraries.  For more information about EOT2012 take a look at that previous post.

EOT2008 PDFs

From the EOT2008 Web archive I had extracted the 4,489,675 unique (by hash) PDF files that were present and carried out a bit of analysis on them as a whole to see if there was anything interesting I could tease out.  The results of that investigation I presented at an IS&T Archiving conference a few years back.  The text from the proceedings for that submission is here and the slides presented are here.

Moving forward several years,  I was curious to see how many of those nearly 4.5 million PDFs were still around in 2012 when we crawled the federal Web again as part of the EOT2012 project.

I used the same hash dataset from the previous post to do this work which made things very easy.  I first pulled the hash values for the 4.489,675 PDF files from EOT2008.  Next I loaded all of the hash values from the EOT2012 crawls. The next and final step was to iterate through each of the PDF file hashes and do a lookup to see if that content hash is present in the EOT2012 hash dataset.  Pretty straightforward.

Findings

After the numbers finished running,  it looks like we have the following.

  PDFs Percentage
Found 774,375 17%
Missing 3,715,300 83%
Total 4,489,675 100%

Put into a pie chart where red equals bad.

EOT2008 PDFs in EOT2012 Archive

EOT2008 PDFs in EOT2012 Archive

So 83% of the PDF files that were present in 2008 are not present in the EOT2012 Archive.

With a little work it wouldn’t be hard to see how many of these PDFs are still present on the web today at the same URL as in 2008.  I would imagine it is a much smaller number than the 17%.

A thing to note about this is that because I am using content hashes and not URLs, it is possible that an EOT2008 PDF is available at a different URL entirely in 2012 when it was harvested again. So the URL might not be available but the content could be available at another location.

If you have questions or comments about this post,  please let me know via Twitter.

Poking at the End of Term 2012 Presidential Web Archive

In preparation for some upcoming work with the End of Term 2016 crawl and a few conference talks I should be prepared for, I thought it might be a good thing to start doing a bit of long-overdue analysis of the End of Term 2012 (EOT2012) dataset.

A little bit of background for those that aren’t familiar with the End of Term program.  Back in 2008 a group of institutions got together to collaboratively collect a snapshot of the federal government with a hope to preserve the transition from the Bush administration into what became the Obama administration.  In 2012 this group added a few additional partners and set out to take another snapshot of the federal Web presence.

The EOT2008 dataset was studied as part of a research project funded by IMLS but the EOT2012 really hasn’t been looked at too much since it was collected.

As part of the EOT process, there are several institutions that crawl data that is directly relevant to their collection missions and then we all share what we collect with the group as a whole for any of the institutions who are interested in acquiring a set of the entire collected EOT archive.  In 2012 the Internet Archive, Library of Congress and the UNT Libraries were the institutions that committed resources to crawling. UNT also was interested in acquiring this archive for its collection which is why I have a copy locally.

For the analysis that I am interested in doing for this blog post, I took a copy of the combined CDX files for each of the crawling institutions as the basis of my dataset.  There was one combined CDX for each of IA, LOC, and UNT.

If you look at the three CDX files to see how many total lines are present, this can give you the number of URLs in the collection pretty easily.  This ends up being the following

Collecting Org Total CDX Entries % of EOT2012 Archive
IA 80,083,182 41.0%
LOC 79,108,852 40.5%
UNT 36,085,870 18.5%
Total 195,277,904 100%

Here is how that looks as a pie chart.

EOT2016 Collection Distribution

EOT2016 Collection Distribution

If you pull out all of the content hash values you get the number of “unique files by content hash” in the CDX file. By doing this you are ignoring repeat captures of the same content on different dates, as well the same content occurring at different URL locations on the same or on different hosts.

Collecting Org Unique CDX Hashes % of EOT2012 Archive
IA 45,487,147 38.70%
LOC 50,835,632 43.20%
UNT 25,179,867 21.40%
Total 117,616,637 100.00%

Again as a pie chart

Unique hash values

Unique hash values

It looks like there was a little bit of change in the percentages of unique content with UNT and LOC going up a few percentage points and IA going down.  I would guess that this is to do with the fact that for the EOT projects,  the IA conducted many broad crawls at multiple times during the project that resulted in more overlap.

Here is a table that can give you a sense of how much duplication (based on just the hash values) there is in each of the collections and then overall.

Collecting Org Total CDX Entries Unique CDX Hashes Duplication
IA 80,083,182 45,487,147 43.20%
LOC 79,108,852 50,835,632 35.70%
UNT 36,085,870 25,179,867 30.20%
Total 195,277,904 117,616,637 39.80%

You will see that UNT has the least duplication (possibly more focused crawls with less repeating) than IA (broader with more crawls of the same data?)

Questions to answer.

There were three questions that I wanted to answer for this look at the EOT data.

  1. How many hashes are common across all CDX files
  2. How many hashes are unique to only one CDX file
  3. How many hashes are shared by two CDX files but not by the third.

Common Across all CDX files

The first was pretty easy to answer and just required taking all three lists of hashes, and identifying which hash appears in each list (intersection).

There are only 237,171 (0.2%) hashes shared by IA, LOC and UNT.

Content crawled by all three

Content crawled by all three

You can see that there is a very small amount of content that is present in all three of the CDX files.

Unique Hashes to one CDX file

Next up was number of hashes that were unique to a collecting organizations CDX file. This took two steps, first I took the difference of two hash sets, took that resulting set and took the difference from the third set.

Collecting Org Unique Hashes Unique to Collecting Org Percentage Unique
IA 45,487,147 42,187,799 92.70%
LOC 50,835,632 48,510,991 95.40%
UNT 25,179,867 23,269,009 92.40%
Unique to a collecting org

Unique to a collecting org

It appears that there is quite a bit of unique content in each of the CDX files.  With over 92% or more of the content being unique to the collecting organization.

Common between two but not three CDX files

The final question to answer was how much of the content is shared between two collecting organizations but not present in the third’s contribution.

Shared by: Unique Hashes
Shared by IA and LOC but not UNT 1,737,980
Shared by IA and UNT but not LOC 1,324,197
Shared by UNT and LOC but not IA 349,490

Closing

Unique and shared hashes

Unique and shared hashes

Based on this brief look at how content hashes are distributed across the three CDX files that make up the EOT2012 archive, I think a takeaway is that there is very little overlap between the crawling that these three organizations carried out during the EOT harvests.  Essentially 97% of content hashes are present in just one repository.

I don’t think this tells all of the story though.  There are quite a few caveats that need to be taken into account.  First of all this only takes into account the content hashes that are included in the CDX files.  If you crawl a dynamic webpage and it prints out the time each time you visit the page, you will get a different content hash.  So “unique” is only in the eyes of the hash function that is used.

There are quite a few other bits of analysis that can be done with this data, hopefully I’ll get around to doing a little more in the next few weeks.

If you have questions or comments about this post,  please let me know via Twitter.

Identify outliers: Building a user interface feature.

Background:

At work we are deep in the process of redesigning the user interface of The Portal to Texas History.  We have a great team in our User Interfaces Unit that I get to work with on this project,  they do the majority of the work and I have been a data gatherer to identify problems that come up in our data.

As we are getting closer to our beta release we had a new feature we wanted to add to the collection and partner detail pages.  Below is the current mockup of this detail page.

Collection Detail Mockup

Collection Detail Mockup

Quite long isn’t it.  We are trying something out (more on that later)

The feature that we are wanting more data for is the “At a Glance” feature. This feature displays the number of unique values (cardinality) of a specific field for the collection or partner.

At A Glance Detail

At A Glance Detail

So in the example above we show that there are 132 items, 1 type, 3 titles, 1 contributing partner, 3 decades and so on.

All this is pretty straight forward so far.

The next thing we want to do is to highlight a box in a different color if it is a value that is different from the normal.  For example if the average collection has three different languages present then we might want to highlight the language box for a collection that had ten languages represented.

There are several ways that we can do this, first off we just made some guesses and coded in values that we felt would be good thresholds.  I wanted to see if we could figure out a way to identify these thresholds based on the data in the collection itself.  That’s what this blog post is going to try to do.

Getting the data:

First of all I need to pull out my “I couldn’t even play an extra who stands around befuddled on a show about statistics, let alone play a stats person on TV” card (wow I really tried with that one) so if you notice horribly incorrect assumptions or processes here, 1. you are probably right, and 2. please contact me so I can figure out what I’m doing wrong.

That being said here we go.

We currently have 453 unique collections in The Portal to Texas History.  For each of these collections we are interested in calculating the cardinality of the following fields

  • Number of items
  • Number of languages
  • Number of series titles
  • Number of resource types
  • Number of countries
  • Number of counties
  • Number of states
  • Number of decades
  • Number of partner institutions
  • Number of items uses

To calculate these numbers I pulled data from our trusty Solr index making use of the stats component and the stats.calcdistinct=true option.  Using this I am able to get the number of unique values for each of the fields listed above.

Now that I have the numbers from Solr I can format them into lists of the unique values and start figuring out how I want to define a threshold.

Defining a threshold:

For this first attempt I decided to try and define the threshold using the Tukey Method that uses the Interquartile Range (IQR).  If you never took any statistics courses (I was a music major so not much math for me) I found this post Highlighting Outliers in your Data with the Tukey Method extremely helpful.

First off I used the handy st program to get an overview of the data that I was going to be working with.

Field N min q1 median q3 max sum mean stddev stderr
items 453 1 98 303 1,873 315,227 1,229,840 2,714.87 16,270.90 764.47
language 453 1 1 1 2 17 802 1.77 1.77 0.08
titles 453 0 1 1 3 955 5,082 11.22 65.12 3.06
type 453 1 1 1 2 22 1,152 2.54 3.77 0.18
country 453 0 1 1 1 73 1,047 2.31 5.59 0.26
county 453 0 1 1 7 445 8,901 19.65 53.98 2.54
states 453 0 1 1 2 50 1,902 4.20 8.43 0.40
decade 453 0 2 5 9 49 2,759 6.09 5.20 0.24
partner 453 1 1 1 1 103 1,007 2.22 7.22 0.34
uses 453 5 3,960 17,539 61,575 10,899,567 50,751,800 112,035 556,190 26,132.1

With the q1 and q3 values we can calculate the IQR for the field and then using the standard 1.5 multiplier or the extreme multiplier of 3 we can add this value back to the q3 value and find our upper threshold.

So for the county field

7 - 1 = 6
6 * 1.5 = 9
7 + 9 = 16

This gives us the threshold values in the table below.

Field Threshold – 1.5 Threshold – 3
items 4,536 7,198
language 4 5
titles 6 9
type 4 5
country 1 1
county 16 25
states 4 5
decade 20 30
partner 1 1
uses 147,997 234,420

Moving forward we can use these thresholds as a way of saying “this field stands out in this collection from other collections”  and make the box in the “At a Glance” feature a different color.

If you have questions or comments about this post,  please let me know via Twitter.