Monthly Archives: April 2017

Compressibility of the DPLA Creator Field by Hub

This is the second post in a series of posts exploring the metadata from the Digital Public Library of America.

In the first post I introduced the idea of using compressibility of a field as a measure of quality.

This post I wanted to look specifically at the dc.creator field in the DPLA metadata dataset.

DC.Creator Overview

The first thing to do is to give you an overview of the creator field in the DPLA metadata dataset.

As I mentioned in the last post there are a total of 15,816,573 records in the dataset I’m working with.  These records are contributed from a wide range of institutions from across the US through Hubs.  There are 32 hubs present in the dataset with 102 records that for one reason or another aren’t associated with a hub and which have a “None” for the hub name.

In the graph below you can see how the number of records are distributed across the different hubs.

Total Records by Hub

These are similar numbers to what you see in the more up-to-date numbers on the DPLA Partners page.

The next chart shows how the number of records per hub and the number of records with creator values compare.

Total Records and Records with Creators by Hub

You should expect that the red columns in the table above will most often be shorter than the blue columns.

Below is a little bit different way of looking at that same data.  This time it is the percentage of records that contain a creator.

Records with Creator to Total Records

You see that a few of the hubs have almost 100% of their records with a creator, while others have a very low percentage of records with creators.

Looking at the number of records that have a creator value and then the total number of names you can see that some hubs like hathitrust have pretty much a 1 to 1 name to record ratio while others like nara have multiple names per record.

Total Creators and Name Instances

To get an even better sense of this you can look at the average creator/name per record. In this chart you see that david_rumsey has 2.49 creators per record, this is followed by nara at 2.03, bhl with 1.78 and internet_archive at 1.70. There are quite a few (14) hubs that have very close to 1 name per record on average.

Average Names Per Record

The next thing to look at is the number of unique names per hub.  The hathitrust hub sticks out again with the most unique names for a hub in the DPLA.

Unique Creators by Hub

Looking at the ratio between the number of unique names and number of creator instances you can see there is something interesting happening with the nara hub.  I put the chart below on a logarithmic scale so you can see things a little better.  Notice that nara has a 1,387:1 ratio between the number of unique creators and the creator instances.

Creator to Unique Ratio

One way to interpret this is to say that the hubs that have the higher ratio have more records that share the same name/creator among records.

Compressibility

Now that we have an overview of the creator field as a whole we want to turn our attention to the compressibility of each of the fields.

I decided to compare the results of four different algorithms, lowercase hash, normalize hash, fingerprint hash, and aggressive fingerprint hash. Below is a table that shows the number of unique values for that field after each of the values has been hashed.  You will notice that as you read from left to right the number will go down.  This relates to the aggressiveness of the hashing algorithms being used.

Hub Unique Names Lowercase Hash Normalize Hash Fingerprint Hash Aggressive Fingerprint Hash
artstor 7,552 7,547 7,550 7,394 7,304
bhl 44,936 44,927 44,916 44,441 42,960
cdl 47,241 46,983 47,209 45,681 44,676
david_rumsey 8,861 8,843 8,859 8,488 8,375
digital-commonwealth 32,028 32,006 32,007 31,783 31,568
digitalnc 31,016 30,997 31,006 30,039 29,730
esdn 22,401 22,370 22,399 21,940 21,818
georgia 21,821 21,792 21,821 21,521 21,237
getty 2,788 2,787 2,787 2,731 2,724
gpo 29,900 29,898 29,898 29,695 29,587
harvard 4,865 4,864 4,855 4,845 4,829
hathitrust 876,773 872,702 856,703 838,848 780,433
il 16,014 15,971 15,983 15,569 15,409
indiana 6,834 6,825 6,832 6,692 6,650
internet_archive 105,381 105,302 104,820 102,390 99,729
kdl 3,098 3,096 3,098 3,083 3,066
mdl 69,617 69,562 69,609 69,013 68,756
michigan 2,725 2,715 2,723 2,676 2,675
missouri-hub 5,160 5,154 5,160 5,070 5,039
mwdl 49,836 49,724 49,795 48,056 47,342
nara 1,300 1,300 1,300 1,300 1,249
None 21 21 21 21 21
nypl 24,406 24,406 24,388 23,462 23,130
pennsylvania 10,350 10,318 10,349 10,056 9,914
scdl 11,976 11,823 11,973 11,577 11,368
smithsonian 67,941 67,934 67,826 67,242 65,705
the_portal_to_texas_history 28,686 28,653 28,662 28,154 28,066
tn 2,561 2,556 2,561 2,487 2,464
uiuc 3,524 3,514 3,522 3,470 3,453
usc 10,085 10,061 10,071 9,872 9,785
virginia 3,732 3,732 3,732 3,731 3,681
washington 12,674 12,642 12,669 12,184 11,659
wisconsin 19,973 19,954 19,960 19,359 19,127

Next I will work through each of the hashing algorithms and look at the compressibility of each field after the given algorithm has been applied.

Lowercase Hash: This hashing algorithm will convert all uppercase characters to lowercase and leave all lowercase characters unchanged.  The result of this is generally very low amounts of compressibility for each of the hubs.  You can see this in the chart below.

Lowercase Hash Compressibility

Normalize HashThis has just converts characters down to their ascii equivalent.  For example it converts gödel to godel.  The compressibility results of this hashing function are quite a bit different than the lowercase hash from above.  You see that hathitrust has 2.3% compressibility of its creator names.

Normalize Hash Compressibility

Fingerprint Hash: This uses the algorithm that OpenRefine describes in depth here.  In the algorithm it incorporates a lowercase function as well as a normalize function in the overall process.  You can see that there is a bit more consistency between the different compressibility values.

Fingerprint Hash Compressibility

Aggressive Fingerprint Hash: This algorithm takes the basic fingerprint algorithm described above and adds one more step.  That step is to remove pieces of the name that are only numbers such as date.  This hashing function will most likely have more false positives that any of the previous algorithms, but it is interesting to look at the results.

Aggressive Fingerprint Hash Compressibility

This final chart puts together the four previous charts so they can be compared a bit easier.

All Compressibility

Conclusion

So now we’ve looked at the compressibility of the the creator fields for each of the 32 hubs that make up the DPLA.

I’m not sure that I have any good takeaways so far in this analysis. I think there are a few other metrics that we should look at before we start saying if this information is or isn’t useful as a metric of metadata quality.

I do know that I was with the compressibility of the hathitrust creators. This is especially interesting when you consider that the source for most of those records are MARC based catalog records that in theory should be backed up with some sort of authority records. Other hubs, especially the service hubs tend to not have records that are based as much on authority records.  Not really ground breaking but interesting to see in the data.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Metadata Fun: Compression as a measure of data quality

The past week had the opportunity to participate in an IMLS-funded workshop about managing local authority records hosted by Cornell University at the Library of Congress.  It was two days of discussions about issues related to managing local and aggregated name authority records. This meeting got me thinking more about names in our digital library metadata both locally (at UNT) and in aggregations (DPLA).

It has been a while since I worked on a project with the DPLA metadata dataset that they provide for bulk download so I figured it was about time to grab a copy and poke around a bit.

This time around I’m interested in looking at some indicators of metadata quality.  Loosely it is how well does a set of metadata conform to itself.  Specifically I want to look at how name values from the dc.creator, dc.contributor, and dc.publisher compare with each other.

I’ll give a bit of an overview to get us started.

If we had these four values in a set of metadata for say the dc.creator of an awesome movie.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we sort these values, make them unique, and then count the instances, we will get the following.

1  Alexander Johan Hjalmar Skarsgard
1  Alexander Johan Hjalmar Skarsgård
1  Skarsgard, Alexander Johan Hjalmar
1  Skarsgård, Alexander Johan Hjalmar

So we have 4 unique name strings in our dataset.

If we applied a normalization algorithm that turned the letter å into an a and then tried to make our data unique we would end up with the following.

2  Alexander Johan Hjalmar Skarsgard
2  Skarsgard, Alexander Johan Hjalmar

Now we have only two name strings in the dataset, each with an instance count of two.

We can measure the compression rate by taking the original number of instances and dividing it by this new number.  4/2 = 2 or a 2:1 compression rate.

Another way to do it is to get the amount of space saved with this compression.  This is just a different equation.  1 – 2/4 = 0.5 or a 50% space savings.

If we apply an algorithm similar to the one that OpenRefine uses and calls a “fingerprint” we can get the following from our first four values.

4 alexander hjalmar johan skarsgard

Now we’ve gone from four values down to one for a 4:1 compression rate or we’ve created a 75% space savings.

Relation to Quality

When we go back to our first four examples, we can come to the opinion pretty quickly that these are most likely supposed to be the same name.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we saw this in our databases we would want to clean these up.  They would most likely lead to poor faceting in our discovery interface.  If a user wanted to find other items that had a dc.creator of Skarsgård, Alexander Johan Hjalmar, it is possible that they wouldn’t find any of the other three items when they clicked on a link to show more.

If we can agree that reducing the number of “near matches” in the dataset is an improvement, we might be able to use these data compression measures as a way of identifying which parts of a digital library might have consistency problems.

That’s exactly what I’m proposing to do here.  I want to find out if we can use a number of different algorithms on the values of dc.creator, dc.contributor, and dc.publisher in the DPLA metadata set and see how much these values compress the data.

Preparing the Data

I’m going to start with the all.json.gz file from the DPLA’s bulk metadata download page.

This file is a very large json file containing 15,816,573 records from the April 2017 DPLA metadata dump.

The first thing that I want to do is to reduce this dataset, which is 6.1GB compressed so something a little more manageable.  I will start with the dc_creator information.  I will use a set of commands for the wonderful tool jq that gets me what I’m wanting.

jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (._source.id), 'creator': ._source.sourceResource.creator?}'

The command I used above will transform each of the records in the DPLA dataset into something that looks like this:

{"provider":"http://dp.la/api/contributor/uiuc","id":"705e1e5f19331a6c8a554ce707059288","creator":null}
{"provider":"http://dp.la/api/contributor/uiuc","id":"bcae15d47f2544caf0407b1e17bf97cd","creator":["Harlow, G","Rogers, J"]}
{"provider":"http://dp.la/api/contributor/uiuc","id":"96cab3354d942e7ea2030f1452f5beb8","creator":["Drummond, S","Ridley, W"]}
{"provider":"http://dp.la/api/contributor/uiuc","id":"e3ce5090d0a8b3c247c84d6f0d5ff16e","creator":["Barber, J.T","Cardon, A"]}

This is now a large file with one small snippet of json on each line.  I can write straightforward Python scripts to process these lines and do some of the heavy lifting for analysis.

For this first pass I’m interested in all of the dc.creators in the whole DPLA dataset to measure the overall compression.

Here is a short set of these values.

Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Committee on Merchant Marine and Fisheries
Herdman, W. A. Sir, (William Abbott), 1858-1924
United States. Committee on Merchant Marine and Fisheries
Henderson, Joseph C
Fancher Creek Nurseries
Roeding, George Christian, 1868-1928
Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Animal and Plant Health Inspection Service
United States. Bureau of Entomology and Plant Quarantine
United States. Plant Pest Control Branch
United States. Plant Pest Control Division

The full list is 10,413,292 lines long when I ignore record instances that don’t have any value for creator.

The next thing to do is sort that list and make it unique which leaves me 1,445,688 unique creators in the DPLA metadata dataset.

Compressing the Data

For the first pass through the data I am going to use the “fingerprint algorithm” that OpenRefine describes in depth here.

The basics are as follows (from OpenRefine’s documentation)

  • remove leading and trailing whitespace
  • change all characters to their lowercase representation
  • remove all punctuation and control characters
  • split the string into whitespace-separated tokens
  • sort the tokens and remove duplicates
  • join the tokens back together
  • normalize extended western characters to their ASCII representation (for example “gödel” → “godel”)

If you’re curious, the code that performs this is in OpenRefine is here.

The next steps are to run this fingerprinting algorithm on each of the1,445,688 creators, sort the created hash values, make them unique and count the resulting lines.  This gives you the new unique creators based on the fingerprint algorithm.

I end up with 1,365,922 unique creator values based on the fingerprint.

That comes to a reduction of 5.52% of the unique values.

To give you an idea of what this looks like for values.  There are eleven different creator instances that have the fingerprint of “akademiia imperatorskaia nauk russia”.

  • Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskai͡a akademii͡a nauk (Russia)
  • Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)
  • Imperatorskaia akademiia nauk (Russia)
  • Imperatorskai͡a akademïi͡a nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)

These 11 different versions of this name are distributed among five different DPLA Hubs.

Below is a table showing how the different versions are distributed across hubs.

Name Records bhl hathitrust internet_archive nypl smithsonian
Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 13 0 11 2 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 7 0 7 0 0 0
Imperatorskai͡a akademii͡a nauk (Russia) 3 0 3 0 0 0
Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia) 1 0 1 0 0 0
Imperatorskaia akademīia nauk (Russia) 13 0 0 0 0 13
Imperatorskaia akademiia nauk (Russia) 4 0 0 0 4 0
Imperatorskai͡a akademïi͡a nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 11 0 11 0 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 13 0 13 0 0 0
Imperatorskaia akademīia nauk (Russia) 211 211 0 0 0 0

When you look at the table you will see that bhl, internet_archive, nypl, and smithsonian each have their preferred way of representing this name.  Hathitrust however has eight different ways that it represents this single creator name in its dataset.

Next Steps

This post hopefully introduced the idea of using “field compressions” for name fields like dc.creator, dc.contributor, and dc.publisher as a way of looking at metadata quality in a dataset.

We calculated the amount of compression using OpenRefine’s fingerprint algorithm for the DPLA creator fields.  This ends up being 5.52% compression.

In the next few posts I will compare the different DPLA Hubs to see how they compare with each other.  I will probably play with a few different algorithms for creating the hash values I use.  Finally I will calculate a few metrics in addition to just the unique values (cardinality) of the field.

If you have questions or comments about this post,  please let me know via Twitter.