User Session Analysis: Investigating Sessions

In the previous post in this series I laid out the work that we were going to do with session data from the UNT Libraries’ Digital Collections.  In order to get the background that this post builds from take a quick look at that post.

In this post we are going to look at the data for the 10,427,111 user sessions that we generated from the 2017 Apache access logs from the UNT Libraries Digital Collections.

Items Per Sessions

The first thing that we will take a look at in the dataset is information about how many different digital objects or items are viewed during a session.

Items Accessed Sessions Percentage of All Sessions
1 8,979,144 86.11%
2 809,892 7.77%
3 246,089 2.36%
4 114,748 1.10%
5 65,510 0.63%
6 41,693 0.40%
7 29,145 0.28%
8 22,123 0.21%
9 16,574 0.16%
10 15,024 0.14%
11 10,726 0.10%
12 9,087 0.09%
13 7,688 0.07%
14 6,266 0.06%
15 5,569 0.05%
16 4,618 0.04%
17 4,159 0.04%
18 3,540 0.03%
19 3,145 0.03%
20-29 17,917 0.17%
30-39 5,813 0.06%
40-49 2,736 0.03%
50-59 1,302 0.01%
60-69 634 0.01%
70-79 425 0.00%
80-89 380 0.00%
90-99 419 0.00%
100-199 2,026 0.02%
200-299 411 0.00%
300-399 105 0.00%
400-499 63 0.00%
500-599 24 0.00%
600-699 43 0.00%
700-799 28 0.00%
800-899 20 0.00%
900-999 6 0.00%
1000+ 19 0.00%

I grouped the item uses per session in order to make the table a little easier to read.  With 86% of sessions being single item accesses that means we have 14% of the sessions that have more than one item access. This is still 1,447,967 sessions that we can look at in the dataset so not bad.

You can also see that there are a few sessions that have a very large number of items associated with them. For example there are 19 sessions that have over 1,000 items being used.  I would guess that this is some sort of script or harvester that is masquerading as a browser.

Here are some descriptive statistics for the items per session data.

N Min Median Max Mean Stdev
10,427,111 1 1 1,828 1.53 4.735

For further analysis we will probably restrict our sessions to those that have under 20 items used in a single session.  While this might remove some legitimate sessions that used a large number of items, it will give us numbers that we can feel a bit more confident about.  That will leave 1,415,596 or 98% of the sessions with more than one item used still in the dataset for further analysis.

Duration of Sessions

The next thing we will look at is the duration of sessions in the dataset.  We limited a single session to all interactions by an IP address in a thirty minute window so that gives us the possibility of sessions up to 1,800 seconds.

Minutes Sessions Percentage of Sessions
0 8,539,553 81.9%
1 417,601 4.0%
2 220,343 2.1%
3 146,100 1.4%
4 107,981 1.0%
5 87,037 0.8%
6 71,666 0.7%
7 60,965 0.6%
8 53,245 0.5%
9 47,090 0.5%
10 42,428 0.4%
11 38,363 0.4%
12 35,622 0.3%
13 33,110 0.3%
14 31,304 0.3%
15 29,564 0.3%
16 27,731 0.3%
17 26,901 0.3%
18 25,756 0.2%
19 24,961 0.2%
20 32,789 0.3%
21 24,904 0.2%
22 24,220 0.2%
23 23,925 0.2%
24 24,088 0.2%
25 24,996 0.2%
26 26,855 0.3%
27 30,177 0.3%
28 39,114 0.4%
29 108,722 1.0%

The table above groups a session into buckets for each minute.  The biggest bucket by number of sessions is the bucket of 0 minutes. This bucket has sessions that are up to 59 seconds in length and accounts for 8,539,553 or 82% of the sessions in the dataset.

Duration Sessions Percent of Sessions Under 1 Min
0 sec 5,892,556 69%
1-9 sec 1,476,112 17%
10-19 sec 478,262 6%
20-29 sec 257,916 3%
30-39 sec 181,326 2%
40-49 sec 140,492 2%
50-59 sec 112,889 1%

You might be wondering about those sessions that lasted only zero seconds.  There are 5,892,556 of them which is 69% of the sessions that were under one minute.  These are almost always sessions that used items as part of an embedded link, a pdf view directly from another site (google, twitter, webpage) or a similar kind of view.

Next Steps

This post helped us get a better look at the data that we are working with.  There is a bit of strangeness here and there with the data but this is pretty normal for situations where you work with access logs.  The Web is a strange place full of people, spiders, bots,  and scripts.

Next up we will actually dig into some of the research questions we had in the first post.  We know how we are going to limit our data a bit to get rid of some of the outliers in the number of items used and we’ve given a bit of information about the large number of very short duration sessions.  So more to come.

If you have questions or comments about this post,  please let me know via Twitter.

User Session Analysis: Connections Between Collections, Type, Institutions

I’ve been putting off some analysis that a few of us at the UNT Libraries have wanted to do with the log files of the UNT Libraries Digital Collections.  This post (and probably a short series to follow) is an effort to get back on track.

There are three systems that we use to provide access to content and those include: The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.

In our digital collections there are a few things that we’ve said over time that we feel very strongly about but which we’ve never really measured.  First off we have said that there is value in co-locating all of our content in the same fairly uniform system instead of building visually and functionally distinct systems for different collections of items.  So instead of each new project or collection going into a new system, we’ve said there is not only cost savings, but real value in putting them all together in a single system.  We’ve said “there is an opportunity for users to not only find content from your collection, but they could find useful connections to other items in the overall digital library”.

Another thing we’ve said is that there is value in putting all different types of digital objects together into our digital systems.  We put the newspapers, photographs, maps, audio, video, and datasets together and we think there is value in that.  We’ve said that users will be able to find newspaper issues, photographs, and maps that might meet their need.  If we had a separate newspaper system, separate video or audio system some of this cross-type discovery would never take place.

Finally we’ve said that there is great value in locating collections from many institutions together in a system like The Portal to Texas History.  We thought (and still think) that users would be able to do a search and it will pull resources together from across institutions in Texas that have matching resources. Because of the geography of the state, you might be finding things that are physically located 10 or 12 hours away from each other at different institutions. In the Portal, these could be displayed together, something that would be challenging if they weren’t co-located in a system.

In our mind these aren’t completely crazy concepts but we do run into other institutions and practitioner that don’t always feel as strongly about this as we do.  The one thing that we’ve never done locally is look at the usage data of the systems and find out:

  • Do users discover and use items from different collections?
  • Do users discover and use items that are different types?
  • Do users discover and use items that are from different contributing partners?

This blog post is going to be the first in a short series that takes a  look at the usage data in the UNT Libraries Digital Collections in an attempt to try and answer some of these questions.

Hopefully that is enough background, now let’s get started:

How to answer the questions.

In order to get started we had to think a little bit about how we wanted to pull together data on this.  We have been generating item-based usage for the digital library collections for a while.  These get aggregated into collection and partner statistics that we make available in the different systems.  The problem with this data is that it just shows what items were used and how many times in a day they were used.  It doesn’t show what was used together.

We decided that we needed to go back to the log files from the digital collections and re-create user sessions to group item usage together.  After we have information about what items were used together we can sprinkle in some metadata about those items and start answering our questions.

With that as a plan we can move to the next step.

Preparing the Data

We decided to use all of the log files for 2017 from our digital collections servers.  This ends up being 1,379,439,042 lines of Apache access logs (geez, over 1.3 billion, or 3.7 million server requests a day).  The data came from two different servers that collectively host all of the application traffic for the three systems that make up the UNT Libraries’ Digital Collections.

We decided that we would define a session as all of the interactions that a single IP address has with the system in a 30 minute window.  If a user uses the system for more than 30 minutes, say 45 minutes, that would count as one thirty minute session and one fifteen minute session.

We started by writing a script that would do three things.  First it would ignore lines in the log file that were from robots and crawlers.  We have a pretty decent list of these bots so that was easy to remove.  Next we further reduced the data by only looking at digital object accesses.  Specifically lines that looked something like ‘/ark:/67531/metapth1000000/`. This pattern in our system denotes an item access and these are what we were interested in.  Finally we only were concerned with accesses that returned content so we only looked at lines that returned a 200 status code.

We filtered the log files down to three columns of data.  The first column was the timestamp for when the http access was made,  the second column was the has of the hashed IP address used to make the request, and the final column was the digital item path requested.  This resulted in a much smaller dataset to work with, from 1,379,439,042 down to 144,405,009 individual lines of data.

Here is what a snipped of data looks like

1500192934      dce4e45d9a90e4a031201b876a70ec0e  /ark:/67531/metadc11591/m2/1/high_res_d/Bulletin6869.pdf
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/43/high_res/
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/41/high_res/
1500192944      b63927e2b8817600aadb18d3c9ab1557  /ark:/67531/metadc33192/m2/1/high_res_d/dissertation.pdf
1500192945      accb4887d609f8ef307d81679369bfb0  /ark:/67531/metacrs10285/m1/1/high_res_d/RS20643_2006May24.pdf
1500192948      decabc91fc670162bad9b41042814080  /ark:/67531/metadc504184/m1/2/small_res/
1500192949      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/
1500192951      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/small_res/
1500192950      c8a320f38b3477a931fabd208f25c219  /ark:/67531/metadc1729/m1/9/med_res_d/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/med_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/3/small_res/
1500192953      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/2/small_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/4/small_res/
1500192955      67ef5c0798dd16cb688b94137b175f0b  /ark:/67531/metadc848614/m1/2/small_res/
1500192963      a19ce3e92cd3221e81b6c3084df2d4a6  /ark:/67531/metadc5270/m1/254/med_res/
1500192961      ea9ba7d064412a6d09ff708c6e95e201  /ark:/67531/metadc85867/m1/4/high_res/

You can see the three columns in the data there.

The next step was actually to sort all of this data by the timestamp in the first column.  You might notice that not all of the lines are in chronological order in the sample above.  By sorting on the timestamp, things will fall into order based on time.

The next step was to further reduce this data down into sessions.  We created a short script that we could feed the data into and it would keep track of the ip addresses it came across, note the objects that the ip hash used, and after a thirty minute period of time (based on the timestamp) it would start the aggregation again.

The result was a short JSON structure that looked like this.

  "arks": ["metapth643331", "metapth656112"],
  "ip_hash": "85ebfe3f0b71c9b41e03ead92906e390",
  "timestamp_end": 1483254738,
  "timestamp_start": 1483252967

This JSON has the ip hash, the starting and ending timestamp for that session, and finally the items that were used.  Each of these JSON structures were placed into a file, a line-oriented set of JSON “files” that would get used in the following steps.

This new line-oriented JSON file is 10,427,111 lines long, with one line representing a single user session for the UNT Libraries’ Digital Collections.  I think that’s pretty cool.

I think I’m going to wrap up this post but in the next post I will take a look at what these users sessions look like with a little bit of sorting, grouping, plotting, and graphing.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Quality Interfaces: Cluster Dashboard (OpenRefine Clustering Baked Right In)

This is the last of the updates from our summer’s activities in creating new metadata interfaces for the UNT Libraries Digital Collections.  If you are interested in the others in this series you can view the past few posts on this blog where I talk about our facet, count, search, and item interfaces.

This time I am going to talk a bit about our Cluster Dashboard.  This interface took a little bit longer than the others to complete.  Because of this, we are just rolling it out this week, but it is before Autumn so I’m calling it a Summer interface.

I warn you that there are going to be a bunch of screenshots here, so if you don’t like those, you probably won’t like this post.

Cluster Dashboard

For a number of years I have been using OpenRefine for working with spreadsheets of data before we load them into our digital repository.  This tool has a number of great features that help you get an overview of the data you are working with, as well as identifying some problem areas that you should think about cleaning up.  The feature that I have always felt was the most interesting was their data clustering interface.  The idea of this interface is that you choose a facet, (dimension, column) of your data and then group like values together.  There are a number of ways of doing this grouping and for an in-depth discussion of those algorithms I will point you to the wonderful OpenRefine Clustering documentation.

OpenRefine is a wonderful tool for working with spreadsheets (and a whole bunch of other types of data) but there are a few challenges that you run into when you are working with data from our digital library collections.  First of all our data generally isn’t rectangular.  It doesn’t easily fit into a spreadsheet.  We have some records with one creator, we have some records with dozens of creators.  There are ways to work with these multiple values but things get complicated. The bigger challenge we generally have is that while many systems can generate a spreadsheet of their data for exporting, very few of them (our system included) have a way of importing those changes back into the system in a spreadsheet format.  This means that while you could pull data from the system, clean it up in OpenRefine, when you were ready to put it back in the system you would run into the problem that there wasn’t a way to get that nice clean data back into the system. A way that you could use OpenRefine was to identify records to change and then have to go back into the system and change records there. But that is far from ideal.

So how did we overcome this? We wanted to use the OpenRefine clustering but couldn’t get data easily back into our system.  Our solution?  Bake the OpenRefine clustering right into the system.  That’s what this post is about.

The first thing you see when you load up the Cluster Dashboard is a quick bit of information about how many records, collections, and partners you are going to be working on values from.  This is helpful to let you know the scope of what you are cluster, both to understand why it might take a while to generate clusters, but also because it is generally better to run these clustering tools over the largest sets of data that you can because it can pull in variations from many different records.  Other than that you are presented with a pretty standard dashboard interface from the UNT Libraries’ Edit System. You can limit to subsets of records with the facets on the left side and the number of items you cluster over will change accordingly.

Cluster Dashboard

The next thing that you will see is a little help box below the clustering stats. This is a help interface that helps to explain how to use the clustering dashboard and a little more information about how the different algorithms work.  Metadata folks generally like to know the fine details about how the algorithms work, or at least be able to find that information if they want to know it later.

Cluster Dashboard Help

The first thing you do is select a field/element/facet that you are interested in clustering. In the example below I’m going to select the Contributor field.

Choosing an Element to Cluster

Once you make a selection you can further limit it to a qualifier, in this case you could limit it to just the Contributors that are organizations, or Contributors that are Composers.  As I said above, using more data generally works better so we will just run the algorithms over all of the values. You next have the option of choosing an algorithm for your clustering.  We recommend to people that they start with the default Fingerprint algorithm because it is a great starting point.  I will discuss the other algorithms later in this post.

Choosing an Algorithm

After you select your algorithm, you hit submit and things start working.  You are given a screen that will have a spinner that tells you the clusters are generating.

Generating Clusters

Depending on your dataset size and the number of unique values of the selected element, you could get your results back on a second or dozens of seconds.  The general flow of data after you hit submit is to query the Solr backend for all of the facet values and their counts.  These values are then processed with the chosen algorithm that creates a “key” for that value.  Another way to think about it is that the values are placed into a bucket that groups similar values together.  There are some calculations that are preformed on the clusters and then they are cached for about ten minutes by the system.  After you wait for the clusters to generate the first time they are much quicker for the next ten minutes.

In the screen below you can see the results of this first clustering.  I will go into detail about the values and options you have to work with the clusters.

Contributor Clusters with Fingerprint Key Collision Hashing

The first thing that you might want to do is sort the clusters in a different way.  By default they are sorted with the value of the cluster key.  Sometimes this makes sense, sometimes it doesn’t make sense as to why something is in a given order.  We thought about displaying the key but found that it was also distracting in the interface.

Different ways of sorting clusters

One of the ways that I like to sort the clusters is by the number of cluster Members.  The image below shows the clusters with this sort applied.

Contributor Field sorted by Members

Here is a more detailed view of a few clusters.  You can see that the name of the Russian composer Shostakovich has been grouped into a cluster of 14 members.  This represents 125 different records in the system with a Contributor element for this composer.  Next to each Member Value you will see a number in parenthesis, this is the number of records that uses that variation of the value.

Contributor Cluster Detail

You can also sort based on the number of records that a cluster contains.  This brings up the most frequently used values.  Generally there are a large number that have a value and then a few records that have a competing value.  Usually pretty easy to fix.

Contributor Element sorted by Records

Sorting by the Average Length Variation can help find values that are strange duplications of themselves.  Repeated phrases, a double copy and paste, strange things like that come to the surface.

Contributor Element sorted by Average Length Variation

Finally sorting by Average Length is helpful if you want to work with the longest or shortest values that are similar.

Contributor Element sorted by Average Length

Different Algorithms

I’m going to go through the different algorithms that we currently have in production.  Our hope is that as time moves forward we will introduce new algorithms or slight variations of algorithms to really get at some of the oddities of the data in the system.  First up is the Fingerprint algorithm.  This is a direct clone of the default fingerprint algorithm used by OpenRefine.

Contributor Element Clustered using Fingerprint Key Collision

A small variation we introduced was instead of replacing punctuation with a whitespace character, the Fingerprint-NS (No Space) just removes the punctuation without adding whitespace.  This would group F.B.I with FBI where the other Fingerprint algorithm wouldn’t group them together.  This small variation surfaces different clusters.  We had to keep reminding ourselves that when we created the algorithms that there wasn’t such a thing as “best”, or “better”, but instead they were just “different”.

Contributor Element Clustered using Fingerprint (No Space) Key Collision

One thing that is really common for names in bibliographic metadata is that they have many dates.  Birth, death, flourished, and so on.  We have a variation of the Fingerprint algorithm that removes all numbers in addition to punctuation.  We call this one Fingerprint-ND (No Dates).  This is helpful for grouping names that are missing dates with versions of the name that have dates.  In the second cluster below I pointed out an instance of Mozart’s name that wouldn’t have been grouped with the default Fingerprint algorithm.  Remember, different, not better or best.

Contributor Element Clustered using Fingerprint (No Dates) Key Collision

From there we branch out into a few simpler algorithms.  The Caseless algorithm just lowercases all of the values and you can see clusters that only differ in ways that are related to upper case or lower case values.

Contributor Element Clustered using Caseless (lowercase) Key Collision

Next up is the ASCII algorithm which tries to group together values that only differ in diacritics.  So for instance the name Jose and José would be grouped together.

Contributor Element Clustered using ASCII Key Collision

The final algorithm is just a whitespace normalization called Normalize Whitespace, it removes consecutive whitespace characters to group values.

Contributor Element Clustered using Normalized Whitespace Key Collision

You may have noticed that the number of clusters went down dramatically from the Fingerprint algorithms to the Caseless, ASCII, or Normalize Whitespace, we generally want people to start with the Fingerprint algorithms because they will be useful most of the time.

Other Example Elements

Here are a few more examples from other fields.  I’ve gone ahead and sorted them by Members (High to Low) because I think that’s the best way to see the value of this interface.  First up is the Creator field.

Creator Element clustered with Fingerprint algorithm and sorted by Members

Next up is the Subject field.  We have so so many ways of saying “OU Football”

Subject Element clustered with Fingerprint algorithm and sorted by Members

The real power of this interface is when you start fixing things.  In the example below I’m wanting to focus in on the value “Football (O U )”.  I do this by clicking the link for that Member Value.

Subject Element Cluster Detail

You are taken directly to a result set that has the records for that selected value.  In this case there are two records with “Football (O U )”.

Selected Records

All you have to do at this point is open up a record, make the edit and publish that record back. Many of you will say “yeah but wouldn’t some sort of batch editing be faster here?”  And I will answer “absolutely,  we are going to look into how we would do that!” (but it is a non-trivial activity due to how we manage and store metadata, so sadface 🙁 )

Subject Value in the Record

There you have it, the Cluster Dashboard and how it works.  The hope is to empower our metadata creators and metadata managers to better understand and if needed, clean up the values in our metadata records.  By doing so we are improving the ability for people to connect different records based on common valuse between the records.

As we move forward we will introduce a number of other algorithms that we can use to cluster values.  There are also some other metrics that we will look at for sorting records to try and tease out “which clusters would be the most helpful to our users to correct first”.  That is always something we are keeping in the back of our head,  how can we provide a sorted list of things that are most in need of human fixing.  So if you are interested in that sort of thing stay tuned, I will probably talk about it on this blog.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Interfaces: Search Dashboard

This is the next blog post in a series that discusses some of the metadata interfaces that we have been working on improving over the summer for the UNT Libraries Digital Collections.  You can catch up on those posts about our Item Views, Facet Dashboard, and Element Count Dashboard if you are curious.

In this post I’m going to talk about our Search Dashboard.  This dashboard is really the bread and butter of our whole metadata editing application.  About 99% of the time a user who is doing some metadata work will login and work with this interface to find the records that they need to create or edit. The records that they see and can search are only ones that they have privileges to edit.  In this post you will see what I see when I login to the system, the nearly 1.9 million records that we are currently managing in our systems.

Let’s get started.

Search Dashboard

If you have read the other post you will probably notice quite a bit of similarity between the interfaces.  All of those other interfaces were based off of this search interfaces.  You can divide the dashboard into three primary sections.  On the left side there are facets that allow you to refine your view in a number of ways.  At the top of the right column is an area where you can search for a term or phrase in a record you are interested in.  Finally under the search box there is a result set of items and various ways to interact with those results.

By default all the records that you have access to are viewable if you haven’t refined your view with a search or a limiting facet.

Edit Interface Search Dashboard

The search section of the dashboard lets you find a specific record or set of records that you are interested in working with.  You can choose to search across all of the fields in the metadata record or just a specific metadata field using the dropdown next to where you enter your search term.  You can search single words, phrases, or unique identifiers for records if you have those.  Once you hit the search button you are on your way.

Search and View Options for Records

Once you have submitted your search you will get back a set of results.  I’ll go over these more in depth in a little bit.

Record Detail

You can sort your results in a variety of ways.  By default they are returned in Title order but you can sort them by the date they were added to the system, the date the original item was created, the date that the metadata record was last modified, the ARK identifier and finally by a completeness metric.   You also have the option to change your view from the default list view to the grid view.

Sort Options

Here is a look at the grid view.  It presents a more visually compact view of the records you might be interested in working with.

Grid View

The image below is a detail of a record view. We tried to pack as much useful information into each row as we  could.  We have the title, a thumbnail, several links to either the edit or summary item view on the left part of the row.  Following that we have the system, collection, and partner that the record belongs to. We have the unique ARK identifier for the object, the date that it was added to the UNT Libraries’ Digital Collections, and the date the metadata was last modified.  Finally we have a green check if the item is visible to the public or a red X if the item is hidden from the public.

Record Detail

Facet Section

There are a number of different facets that a user can use to limit the records they are working with to a smaller subset.  The list is pretty long so I’ll first show you it in a single image and then go over some of the specifics in more detail below.

Facet Options

The first three facets are the system, collection and partner facets.  We have three systems that we manage records for with this interface, The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.

Each digital item can belong to multiple collections and generally belongs to a single partner organization.  If you are interested in just working on the records for the KXAS-NBC 5 New Collection you can limit your view of records by selecting that value from the Collections facet area.

System, Collections and Partners Facet Options

Next are the Resource Type and Visibility facets.  It is often helpful to limit to just a specific resource type, like Maps when you are doing your metadata editing so that you don’t see things that you aren’t interested in working with.  Likewise there are some kinds of metadata editing that you want to focus primarily on items that are already viewable to the public and you don’t want the hidden records to get in the way. You can do this with the Visibility facet.

Resource Type and Visibility Facet Options

Next we start getting into the new facet types that we added this summer to help identify records that need some metadata uplift.  We have the Date Validity, My Edits, and Location Data facets.

Date Validity is a facet that allows you to identify records that have dates in them that are not valid according to the Extended Date Time Format (EDTF).  There are two different fields in a record that are checked, the date field and the coverage field (which can contain dates).  If any of these aren’t valid EDTF strings then we mark the whole record as having Invalid Dates.  You can use this facet to identify these and go in a correct those values.

Next up is a facet for just the records that you have edited in the past.  This can be helpful for a number of reasons.  I use it from time to time to see if any of the records that I’ve edited have developed any issues like dates that aren’t valid since I last edited them.  It doesn’t happen often but can be helpful.

Finally there is a section of Location Data.  This set of facets is helpful for identifying records which have or don’t have a Place Name, Place Point, or Place Box in the record.  Helpful if you are working through a collection trying to add geographic information to the records.

Date Validity, My Edits, and Location Data Facet Options

The final set of facets are Recently Edited Records, and Record Completeness.  The first is the Recently Edited Records which is pretty straight forward.  This just a listing of how many records have been edited in the past 24h, 48h, 7d, 30d, 180d, 365d in the system.  One note that causes a bit of confusion here is that these are records that are edited by  anyone in the past period of time.  It is often misunderstood as “your edits” in a given period of time which isn’t true.  Still very helpful but can get you into some strange results if you think about it the other way.

The last facet value is for the Record Completeness. We really have two categories, records that have a completeness of 1.0 (Complete Records) or records that are less than 1.0 (Incomplete Records).  This metric is calculated when the item is indexed in the system and based on our notion of a minimally viable record.

Recently Edited Records and Record Completeness Facet Options

This finishes this post about the Search Dashboard for the UNT Libraries Digital Collections.  We have been working to build out this metadata environment for about the last eight years and have slowly refined it to the metadata creation and editing workflows that seem to work for the widest number of folks here at UNT.  There are always improvements that we can make and we have been steadily chipping away at those over time.

There are a few other things that we’ve been working on over the summer that I will post about in the next week or so, so stay tuned for more.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Quality Interfaces: Element Count Dashboard

Next up in our review of the new metadata quality interfaces we have implemented this summer is our Element Count Dashboard.

The basics of this are that whenever we index metadata records in our Solr index we go ahead and count the number of instances of a given element, or a given element with a specific qualifier and store those away in the index.  This results in hundreds of fields that are the counts of element instances in those fields.

We built an interface on top of these counts because we had a hunch that we would be able to use this information to help us identify problems in our metadata records.  It feels like I’m showing some things in our metadata that we probably don’t want to really highlight but it is all for helping others understand.  So onward!

Element Count Dashboard

The dashboard is similar to other dashboards in the Edit system.  You have the ability to limit your view to just the collection, partner or system you are interested in working with.

Count Dashboard

From there you can select an element you are interested in viewing counts for.  In the example below I am interested in looking at the Description element or field.

Select an Element to View Counts

Once your selection is made you are presented with the number of instances of the description field in a record.  This is a little more helpful if you know that in our metadata world, a nice clean record will generally have two description fields.  One for a content description and one for a physical description of the item. More than two is usually strange and less than one is usually bad.

Counts for Description Elements

To get a clearer view you can see the detail below.  This again is for the top level Description element where we like to have two descriptions.

Detail of Description Counts

You can also limit to a qualifier specifically.  In the example below you see the counts of Description elements with a content qualifier.  The 1,667 records that have two Description elements with a content qualifier are pretty strange.  We should probably fix those.

Detail of Description Counts for Content Qualifier

Next we limit to just the physical description qualifier. You will see that there are a bunch that don’t have any sort of physical description and then 76 that have two. We should fix both of those record sets.

Detail of Description Counts for Physical Qualifier

Because of the way that we index things we can also get at the Description elements that don’t have either a content or physical qualifier selected.  These are identified with a value of none for the qualifier.  You can see that there are 1,861,356 records that have zero Description elements with a none qualifier.  That’s awesome.  You can also see 52 that have one element and 261 that have two elements that are missing qualifiers.  That’s not awesome.

Detail of Description Counts for None Qualifier

I’m hoping you are starting to see how this kind of interface could be useful to drill into records that might look a little strange.  When you identify something strange all you have to do is click on the number and you are taken directly to the records that match what you’ve asked for.  In the example below we are seeing all 76 of the records that have two physical descriptions because this is something we are interested in correcting.

Records with Multiple Duplicate Physical Qualifiers

If you open up a record to edit you will see that yes, in fact there are two Physical Descriptions in this record. It looks like the first one should actually be a Content Description.

Example of two physical descriptions that need to be fixed

Once we change that value we can hit the Publish button and be on our way fixing other metadata records.  The counts will update about thirty seconds later to reflect the corrections that you have made.

Fixed Physical and Content Descriptions

Even more of a good thing.

Because I think this is a little different than other interfaces you might be used to, it might be good to see another example.

This time we are looking at the Creator element in the Element Count Dashboard.

Creator Counts

You will see that there are 112 different counts from zero way up into way way too many creators on an item (silly physics articles).

I was curious to see what the counts looked like for Creator elements that were missing a role qualifier.  These are identified by selecting the none value from the qualifier dropdown.

Creator Counts for Missing Qualifiers

You can see that the majority of our records don’t have Creator elements missing the role qualifier but there are a number that do.  We can fix those.  If you wanted to look at those records that have five different Creator elements that don’t have a role you would end up getting to records that loo like the one below.

Example of Multiple Missing Types and Roles

You will notice that when a record has a problem there are often multiple things wrong with it. In this case not only is it missing role information for each of these Creator elements but there is also name type information that is missing.  Once we fix those we can move along and edit some more.

And a final example.

I’m hoping you are starting to see how this interface could be useful.  Here is another example if you aren’t convinced yet.  We are completing a retrospective digitization of theses and dissertations here at UNT.  Not only is this a bunch of digitization but it is quite a bit of metadata that we are adding to both the UNT Digital Library as well as our traditional library catalog.   Let’s look at some of those records.

You can limit your dashboard view to the collection you are interested in working on.  In this case we choose the UNT Theses and Dissertations collection.

Next up we take a look at the number of Creator elements per record. Theses and dissertations are generally authored by just one person.  It would be strange to see counts other than one.

Creator Counts for These and Dissertations Collection

It looks lie there are 26 records that are missing Creator elements and a single record that for some reason has two Creator elements.  This is strange and we should take a look.

Below you will see the view of the 26 records that are missing a Creator element.  Sadly at the time of writing there are seven of these that are visible to the public so that’s something we really need to fix in a hurry.

Example Theses that are Missing Creators

That’s it for this post about our Element Count Dashboard.  I hope that you find this sort of interface interesting.  I’d be interested to hear if you have interfaces like this for your digital library collections or if you think something like this would be useful in your metadata work.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Quality Interfaces: Facet Dashboard

This is the second post in a series that discusses the new metadata interfaces we have been developing for the UNT Libraries’ Digital Collections metadata editing environment. The previous post was related to the item views that we have created.

This post discusses our facet dashboard in a bit of depth.  Let’s get started.

Facet Dashboard

A little bit of background is in order so that you can better understand the data that we are working with in our metadata system.  The UNT Libraries uses a locally-extended Dublin Core metadata element set. In addition to locally-extending the elements to include things like collection, partner, degree, citation, note, and meta fields we also qualify many of the fields. A qualifier usually specifics what type of value is represented.  So a subject could be a Keyword, or an LCSH value. A Creator could be an author, or a photographer.  Many of the fields have the ability to have one qualifier for the value.

When we index records in our Solr instance we store strings of each of these elements, and each of the elements plus qualifiers, so we have fields we can facet on.  This results in facet fields for creator as well as specifically creator_author, or creator_photographer.  For fields that we expect the use of a qualifier we also capture when there isn’t a qualifier in a field like creator_none.  This results in many hundreds of fields in our Solr index but we do this for good reason,  to be able to get at the data in ways that are helpful for metadata maintainers.

The first view we created around this data was our facet dashboard.  The image below shows what you get when you go to this view.

Default Facet Dashboard

On the left side of the screen you are presented with facets that you can make use of to limit and refine the information you are interested in viewing.  I’m currently looking at all of the records from all partners and all collections.  This is a bit over 1.8 million records.

The next step is to decide which field you are interested in seeing the facet values for.  In this case I am choosing the Creator field.

Selecting a field to view facet values

After you make a selection you are presented with a paginated view of all of the creator values in the dataset (289,440 unique values in this case). These are sorted alphabetically so the first values are the ones that generally start with punctuation.

In addition to the string value you are presented the number of records in the system that have that given value.

All Creator Values

Because there can be many many pages of results sometimes it is helpful to jump directly to a subset of the records.  This can be accomplished with a “Begins With” dropdown in the left menu.  I’m choosing to look at only facets that start with the letter D.

Limit to a specific letter

After making a selection you are presented with the facets that start with the letter D instead of the whole set.  This makes it a bit easier to target just the values you are looking for.

Creator Values Starting with D

Sometimes when you are looking at the facet values you are trying to identify values that fall next to each other but that might differ only a little bit. One of the things that can make this a bit easier is having a button that can highlight just the whitespace in the strings themselves.

Highlight Whitespace Button

Once you click this button you see that the whitespace is now highlighted in green.  This highlighting in combination with using a monospace font makes it easier to see when values only differ with the amount of whitespace.

Highlighted Whitespace

Once you have identified a value that you want to change the next thing to do is just click on the link for that facet value.

Identified Value to Correct

You are taken to a new tab in your browser that has just the records that have the selected value.  In this case there was just one record with “D & H Photo” that we wanted to edit.

Record with Identified Value

We have a convenient highlighting of visited rows on the facet dashboard so you know which values you have clicked on.

Highlighted Reminder of Selected Value

In addition to just seeing all of the values for the creator field you can also limit your view to a specific qualifier by selecting the qualifier dropdown when it is available.

Select an Optional Qualifier

You can also look at items that don’t have a given value, for example Creator values that don’t have a name type designated.  This is identified with a qualifier value of none-type.

Creator Values Without a Designated Type

You get just the 900+ values in the system that don’t have a name type designated.

All of this can be performed on any of the elements or any of the qualified elements of the metadata records.

While this is a useful first step in getting metadata editors directly to both the values of fields and their counts in the form of facets, it can be improved upon.  This view still requires users to scan a long long list of items to try and identify values that should be collapsed because they are just different ways of expressing the same thing with differences in spacing or punctuation. It is only possible to identify these values if they are located near each other alphabetically.  This can be a problem if you have a field like a name field that can have inverted or non-inverted strings for names.  So there is room for improvement of these interfaces for our users.

Our next interface to talk about is our Count Dashboard.  But that will be in another post.

If you have questions or comments about this post,  please let me know via Twitter.




Updating Metadata Interfaces: Item Views

As we get started with a new school year it is good to look back on all of the work that we accomplished over the summer.

There are a few reasons that we are interested in improving our metadata entry systems. First as we continue to add records and approach our 2 millionth item in the system, it is clear that effective management of metadata is important. We also see an increase in the resources being allocated to editing metadata in our digital library systems. We are to a point where there are more people using the backend systems for non-MARC metadata than we have working with the catalog and MARC based metadata. Because of this we are seeing our metadata workers spend more and more time in this system so it is important that we try and make things better so that they can complete their tasks easier.  This improves quality, costs, and our workers sanity.

This blog post is just a quick summary of some of the changes that we have made around item records in the UNT Libraries Digital Collection’s Edit System.

Dashboard View

Not much really has changed with the edit dashboard other thank making room for the views that I’m going to talk about later in this post.  Historically when an editor clicked on either the title or the thumbnail of the record, it would take them to the edit view for the record.  In fact that was pretty much the only public view for an item’s record, edit.

While editing or creating a record is the primary activity you want to do in this system, there are many times when you want to look at the records in different ways.  This previously wasn’t possible without having to do a little URL hacking.

Now when you click on the thumbnail, title, or summary button you are taken to the summary page that I will talk about next.  If you click the Edit button you are then taken to the edit view .

Edit Dashboard

Record Summary View

We wanted to add a new landing page for an item in our editing system so that a user could just view or look at a record instead of always having to jump into the edit window.  There are a number of reasons for this.  First off the edit view isn’t the easiest to see what is going on in a metadata record,  it is designed to edit fields and does not give you a succinct view of the record.  Second it actually locks the record for a period of time when it is open.  So even if you just open it and leave it alone, it will be locked in the system for about half an hour.  This doesn’t cause too many issues but it isn’t ideal. Finally just having an edit view resulted in a high number of “edits” to records that really weren’t edits at all, users were just hitting publish in order to clear out the record an close it.  Because we version all of our metadata changes this just adds versions that don’t really represent much in a way of change between records.

We decided that we should introduce a summary view for each record.  This would allow us to provide an overview of the state of an item record as well as providing a succinct metadata view and finally a logical place to put additional links to other important record views.

The image below is the top portion of a summary view for a metadata record in the system. I will go thorough some of the details below.

Record Summary

The top portion of the summary view gives a large image that represents the item.  This is usually the first page of the publication, the front of a photograph or map (we scan the backs of everything), or a thumbnail view of a moving image item.  Next to that you will see a large title to easily identify the title of the item.

Below that is a quick link to edit the record in the edit view.  If the item is visible to the public then you can quickly jump to the live view of the record with the “View in the Digital Library” link. Next we have a “View Item” link that takes you to a viewer that allows you to page through the object even if it isn’t online yet.  This item view is used during the creation of metadata records.  Finally you see a link to the “View History” link that takes you to an overview of the history of the item to see when things changed and who changed them.

Below are some quick visuals for if the item is public, if it is currently unlocked and able to be edited by the user, if the metadata has a completeness score of 1.0 (minimally viable record) and finally if all of the dates in the item are valid Extended Date Time dates.

This is followed by the number of unique editors that have edited the record, and the username of the last editor of the record.  Finally the date the item was last edited, and when it was added to the system are shown.

Record Interactions

Since we do keep the version history of all of the changes to a metadata record over time we wanted to give an idea of the lifecycle of the record.  A record can go back and forth from a state of hidden to public and sometimes back to hidden.  We decided a simple timeline would be a good way to better visualize these different states of records over time.

Record Timeline

The final part of the summary view is the succinct metadata display.  This is helpful to get a quick overview of a record.  It is in a layout that is consistent across fields and records of different types.  It will all print to about a page if you need to print it out in paper format (something that you really need to be able to do from time to time).

Succinct Record Display

History View

We have had a history view for each item for a number of years but until this summer it was only available if you knew to add /history/ to the end of a URL in the edit system.  When we added the summary page we now had a logical place to place a link to this page.

The only modification we’ve done for the page is a little bit of coloring for when a change results in a size difference in the record.  Blue for growth and orange for a reduction in size. There are a few more changes that I would like us to make to the history view which we will probably work on in the fall.  The main thing I want to add is more information about what changed in the records,  which fields for instance.  That’s very helpful in trying to track down oddities in records.

Record History

Title Case Helper

This last set of features are pretty small but are actually a pretty big help when they are needed.  We work with quite a bit of harvested data and metadata from different systems that we add to our digital collections. When we get these dataset sometimes they had different views of when to capitalize and when not to capitalize words. We have a collection from the federal government that has all of the titles and names all in uppercase.  Locally we tend to recommend a fairly standard title case for things like titles, and names also tend to follow this pattern.

We added some javascript helpers to identify when a title, creator, contributor, or publisher is in upper case and present the user with a warning message.  We actually are looking at instances that have more than 50% capital letters in the string. The warning doesn’t keep a person from saving the record, just gives them a visual clue that there is something they might want to change.

Title Case Warnings

After the warning we wanted to make it easier for users to convert a string from upper case to title case. If you haven’t tried to do this recently it is actually pretty time consuming and generally results in you just having to retype the value instead of changing it letter by letter. We decided that a button that could automatically convert the value into title case would save quite a bit of time.  The image below shows where this TC button is located for the title, creator, contributor, and publisher fields.

Creator Title Case Detail

Once you click the button it will change the main value to something that resembles title case.  It has some logic to deal with short words that are generally not capitalized like: and, of, or, the.

Corrected Title Case Detail

This saves quite a bit of time but isn’t perfect.  If you have abbreviations in the string they will be lost so you sometimes have to edit things after hitting the tc button.  Even so, it helps with a pretty fiddly task.

That covers many of changes that we made this summer to the item and record views in our system.  There are a few more interfaces that we added to our edit system that I will try and cover in the next week or so.

If you have questions or comments about this post,  please let me know via Twitter.

How do metadata records change over time?

Since September 2009 the UNT Libraries has been versioning the metadata edits that happen in the digital library system that powers The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.  In those eight years the collection has grown from a modest size of 66,000 digital objects to the 1,814,000 digital objects that we manage today.  We’ve always tried to think of the metadata in our digital library as a constantly changing dataset.  Just how much it changes we don’t always pay attention to.

In 2014 a group of us worked on a few papers about metadata change at a fairly high level in the repository. How Descriptive Metadata Changes in the UNT Libraries Collections: A Case Study  this paper reported out on the analysis of almost 700,000 records that were in the repository at that time.  Another study Exploration of Metadata Change in a Digital Repository was presented the following year in 2015 by some colleagues in the UNT College of Information that used a smaller sample of records to answer a few more questions about what changes in descriptive metadata at the UNT Libraries.

It has been a few years since these studies so it is time again to take a look at our metadata and do a little analysis to see if anything pops out.

Metadata Edit Dataset

The dataset we are using for this analysis was generated on May 4th, 2017 by creating a copy of all of the metadata records and their versions to a local filesystem for further analysis. The complete dataset is for 1,811,640 metadata records.

Of those 1,811,640 metadata records, 683,933 had been edited at least once since they were loaded into the repository.  There are 62% of the records that have just one instance (no changes) in the system and another 38% that have at least one edit.

Records Edited in Dataset

We store all of our metadata on the filesystem as XML files using a local metadata format we call UNTL.  When a record is edited, the old version of the record is renamed with a version number and the new version of the record takes its place as the current version of a record. This has worked pretty well over the years for us and allows us to view previous versions of metadata records through a metadata history screen in our metadata system.

UNT Metadata History Interface

This metadata history view is helpful for tracking down strange things that happen in metadata systems from time to time.  Because some records are edited multiple times (like in the example screenshot above) we end up with a large number of metadata edits that we can look at over time.

After staging all of the metadata records on a local machine I wrote a script that would compare two different records and output which elements in the record changed. While this sounds like a pretty straight forward thing to do, there are some fiddly bits that you need to watch out for that I will probably cover in a separate blog post. Most of these have to do with XML as a serialization format and some of the questions on how you interpret different things.  As a quick example think about these three notations.

<title />
<title qualifier='officialtitle'></title>

When comparing fields should those three examples all mean the same thing as far as a metadata record is concerned?  But like I said something to get into for a later post.

Once I had my script to compare two records, the next step was to create pairs of records to compare and then iterate over all of those record pairs.  This resulted in 1,332,936 edit events that I could look at.  I created a JSON document for each of these edit events and then loaded this document into Solr for some later analysis.  Here is what one of these records looks like.

  "change_citation": 0,
  "change_collection": 0,
  "change_contributor": 1,
  "change_coverage": 0,
  "change_creator": 0,
  "change_date": 0,
  "change_degree": 0,
  "change_description": 0,
  "change_format": 0,
  "change_identifier": 1,
  "change_institution": 0,
  "change_meta": 1,
  "change_note": 0,
  "change_primarySource": 0,
  "change_publisher": 0,
  "change_relation": 0,
  "change_resourceType": 0,
  "change_rights": 0,
  "change_source": 0,
  "change_subject": 0,
  "change_title": 0,
  "collections": [
  "completeness_change": 0,
  "content_length_change": 12,
  "creation_to_edit_seconds": 123564535,
  "edit_number": 1,
  "elements_changed": 3,
  "id": "metadc58589_2015-10-16T11:02:09Z",
  "institution": [
  "metadata_creation_date": "2011-11-16T07:33:14Z",
  "metadata_edit_date": "2015-10-16T11:02:09Z",
  "metadata_editor": "htarver",
  "r1_ark": "ark:/67531/metadc58589",
  "r1_completeness": 0.9830508474576272,
  "r1_content_length": 2108,
  "r1_record_length": 2351,
  "r2_ark": "ark:/67531/metadc58589",
  "r2_completeness": 0.9830508474576272,
  "r2_content_length": 2120,
  "r2_record_length": 2543,
  "record_length_change": 192,
  "systems": "DC"

Some of the fields don’t mean much now but the main fields we want to look at are the change_* fields.  These represent the 21 metadata elements that we have use here for the UNTL metadata format.  Here they are in a more compact view.

  • title
  • creator
  • contributor
  • publisher
  • date
  • description
  • subject
  • primarySource
  • coverage
  • source
  • citation
  • relation
  • collection
  • institution
  • rights
  • resourceType
  • format
  • identifier
  • degree
  • note
  • meta

You may notice that these elements include the 15 Dublin Core elements plus six other fields that we’ve found useful to have in our element set.

The first thing I wanted to answer was which of these 21 fields was edited the most in the 1.3 million records edits that we have.

Metadata Element Changes

You can see that the meta field in the records changes almost 100% of the time.  That is because whenever you edit a record the values of the most recent metadata editor and the edit time change so the values of the elements should change each edit.

I have to admit that I was surprised that the description field was the most edited field in the metadata edits.  There were 403,713 (30%) of the edits that had the description field change in some way. This is followed by title at 304,396 (23%) and subject at 272,703 (20%).

There are a number of other things that I will be doing with this dataset as I move forward. In addition to what fields changed I should be able to look at how many field on average change in records.  I then want to see if there are any noticeable differences when you look at different subsets like specific editors, or collections.

So if you are interested in metadata change stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.

Compressibility of the DPLA Creator Field by Hub

This is the second post in a series of posts exploring the metadata from the Digital Public Library of America.

In the first post I introduced the idea of using compressibility of a field as a measure of quality.

This post I wanted to look specifically at the dc.creator field in the DPLA metadata dataset.

DC.Creator Overview

The first thing to do is to give you an overview of the creator field in the DPLA metadata dataset.

As I mentioned in the last post there are a total of 15,816,573 records in the dataset I’m working with.  These records are contributed from a wide range of institutions from across the US through Hubs.  There are 32 hubs present in the dataset with 102 records that for one reason or another aren’t associated with a hub and which have a “None” for the hub name.

In the graph below you can see how the number of records are distributed across the different hubs.

Total Records by Hub

These are similar numbers to what you see in the more up-to-date numbers on the DPLA Partners page.

The next chart shows how the number of records per hub and the number of records with creator values compare.

Total Records and Records with Creators by Hub

You should expect that the red columns in the table above will most often be shorter than the blue columns.

Below is a little bit different way of looking at that same data.  This time it is the percentage of records that contain a creator.

Records with Creator to Total Records

You see that a few of the hubs have almost 100% of their records with a creator, while others have a very low percentage of records with creators.

Looking at the number of records that have a creator value and then the total number of names you can see that some hubs like hathitrust have pretty much a 1 to 1 name to record ratio while others like nara have multiple names per record.

Total Creators and Name Instances

To get an even better sense of this you can look at the average creator/name per record. In this chart you see that david_rumsey has 2.49 creators per record, this is followed by nara at 2.03, bhl with 1.78 and internet_archive at 1.70. There are quite a few (14) hubs that have very close to 1 name per record on average.

Average Names Per Record

The next thing to look at is the number of unique names per hub.  The hathitrust hub sticks out again with the most unique names for a hub in the DPLA.

Unique Creators by Hub

Looking at the ratio between the number of unique names and number of creator instances you can see there is something interesting happening with the nara hub.  I put the chart below on a logarithmic scale so you can see things a little better.  Notice that nara has a 1,387:1 ratio between the number of unique creators and the creator instances.

Creator to Unique Ratio

One way to interpret this is to say that the hubs that have the higher ratio have more records that share the same name/creator among records.


Now that we have an overview of the creator field as a whole we want to turn our attention to the compressibility of each of the fields.

I decided to compare the results of four different algorithms, lowercase hash, normalize hash, fingerprint hash, and aggressive fingerprint hash. Below is a table that shows the number of unique values for that field after each of the values has been hashed.  You will notice that as you read from left to right the number will go down.  This relates to the aggressiveness of the hashing algorithms being used.

Hub Unique Names Lowercase Hash Normalize Hash Fingerprint Hash Aggressive Fingerprint Hash
artstor 7,552 7,547 7,550 7,394 7,304
bhl 44,936 44,927 44,916 44,441 42,960
cdl 47,241 46,983 47,209 45,681 44,676
david_rumsey 8,861 8,843 8,859 8,488 8,375
digital-commonwealth 32,028 32,006 32,007 31,783 31,568
digitalnc 31,016 30,997 31,006 30,039 29,730
esdn 22,401 22,370 22,399 21,940 21,818
georgia 21,821 21,792 21,821 21,521 21,237
getty 2,788 2,787 2,787 2,731 2,724
gpo 29,900 29,898 29,898 29,695 29,587
harvard 4,865 4,864 4,855 4,845 4,829
hathitrust 876,773 872,702 856,703 838,848 780,433
il 16,014 15,971 15,983 15,569 15,409
indiana 6,834 6,825 6,832 6,692 6,650
internet_archive 105,381 105,302 104,820 102,390 99,729
kdl 3,098 3,096 3,098 3,083 3,066
mdl 69,617 69,562 69,609 69,013 68,756
michigan 2,725 2,715 2,723 2,676 2,675
missouri-hub 5,160 5,154 5,160 5,070 5,039
mwdl 49,836 49,724 49,795 48,056 47,342
nara 1,300 1,300 1,300 1,300 1,249
None 21 21 21 21 21
nypl 24,406 24,406 24,388 23,462 23,130
pennsylvania 10,350 10,318 10,349 10,056 9,914
scdl 11,976 11,823 11,973 11,577 11,368
smithsonian 67,941 67,934 67,826 67,242 65,705
the_portal_to_texas_history 28,686 28,653 28,662 28,154 28,066
tn 2,561 2,556 2,561 2,487 2,464
uiuc 3,524 3,514 3,522 3,470 3,453
usc 10,085 10,061 10,071 9,872 9,785
virginia 3,732 3,732 3,732 3,731 3,681
washington 12,674 12,642 12,669 12,184 11,659
wisconsin 19,973 19,954 19,960 19,359 19,127

Next I will work through each of the hashing algorithms and look at the compressibility of each field after the given algorithm has been applied.

Lowercase Hash: This hashing algorithm will convert all uppercase characters to lowercase and leave all lowercase characters unchanged.  The result of this is generally very low amounts of compressibility for each of the hubs.  You can see this in the chart below.

Lowercase Hash Compressibility

Normalize HashThis has just converts characters down to their ascii equivalent.  For example it converts gödel to godel.  The compressibility results of this hashing function are quite a bit different than the lowercase hash from above.  You see that hathitrust has 2.3% compressibility of its creator names.

Normalize Hash Compressibility

Fingerprint Hash: This uses the algorithm that OpenRefine describes in depth here.  In the algorithm it incorporates a lowercase function as well as a normalize function in the overall process.  You can see that there is a bit more consistency between the different compressibility values.

Fingerprint Hash Compressibility

Aggressive Fingerprint Hash: This algorithm takes the basic fingerprint algorithm described above and adds one more step.  That step is to remove pieces of the name that are only numbers such as date.  This hashing function will most likely have more false positives that any of the previous algorithms, but it is interesting to look at the results.

Aggressive Fingerprint Hash Compressibility

This final chart puts together the four previous charts so they can be compared a bit easier.

All Compressibility


So now we’ve looked at the compressibility of the the creator fields for each of the 32 hubs that make up the DPLA.

I’m not sure that I have any good takeaways so far in this analysis. I think there are a few other metrics that we should look at before we start saying if this information is or isn’t useful as a metric of metadata quality.

I do know that I was with the compressibility of the hathitrust creators. This is especially interesting when you consider that the source for most of those records are MARC based catalog records that in theory should be backed up with some sort of authority records. Other hubs, especially the service hubs tend to not have records that are based as much on authority records.  Not really ground breaking but interesting to see in the data.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Metadata Fun: Compression as a measure of data quality

The past week had the opportunity to participate in an IMLS-funded workshop about managing local authority records hosted by Cornell University at the Library of Congress.  It was two days of discussions about issues related to managing local and aggregated name authority records. This meeting got me thinking more about names in our digital library metadata both locally (at UNT) and in aggregations (DPLA).

It has been a while since I worked on a project with the DPLA metadata dataset that they provide for bulk download so I figured it was about time to grab a copy and poke around a bit.

This time around I’m interested in looking at some indicators of metadata quality.  Loosely it is how well does a set of metadata conform to itself.  Specifically I want to look at how name values from the dc.creator, dc.contributor, and dc.publisher compare with each other.

I’ll give a bit of an overview to get us started.

If we had these four values in a set of metadata for say the dc.creator of an awesome movie.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we sort these values, make them unique, and then count the instances, we will get the following.

1  Alexander Johan Hjalmar Skarsgard
1  Alexander Johan Hjalmar Skarsgård
1  Skarsgard, Alexander Johan Hjalmar
1  Skarsgård, Alexander Johan Hjalmar

So we have 4 unique name strings in our dataset.

If we applied a normalization algorithm that turned the letter å into an a and then tried to make our data unique we would end up with the following.

2  Alexander Johan Hjalmar Skarsgard
2  Skarsgard, Alexander Johan Hjalmar

Now we have only two name strings in the dataset, each with an instance count of two.

We can measure the compression rate by taking the original number of instances and dividing it by this new number.  4/2 = 2 or a 2:1 compression rate.

Another way to do it is to get the amount of space saved with this compression.  This is just a different equation.  1 – 2/4 = 0.5 or a 50% space savings.

If we apply an algorithm similar to the one that OpenRefine uses and calls a “fingerprint” we can get the following from our first four values.

4 alexander hjalmar johan skarsgard

Now we’ve gone from four values down to one for a 4:1 compression rate or we’ve created a 75% space savings.

Relation to Quality

When we go back to our first four examples, we can come to the opinion pretty quickly that these are most likely supposed to be the same name.

Alexander Johan Hjalmar Skarsgård
Skarsgård, Alexander Johan Hjalmar
Alexander Johan Hjalmar Skarsgard
Skarsgard, Alexander Johan Hjalmar

If we saw this in our databases we would want to clean these up.  They would most likely lead to poor faceting in our discovery interface.  If a user wanted to find other items that had a dc.creator of Skarsgård, Alexander Johan Hjalmar, it is possible that they wouldn’t find any of the other three items when they clicked on a link to show more.

If we can agree that reducing the number of “near matches” in the dataset is an improvement, we might be able to use these data compression measures as a way of identifying which parts of a digital library might have consistency problems.

That’s exactly what I’m proposing to do here.  I want to find out if we can use a number of different algorithms on the values of dc.creator, dc.contributor, and dc.publisher in the DPLA metadata set and see how much these values compress the data.

Preparing the Data

I’m going to start with the all.json.gz file from the DPLA’s bulk metadata download page.

This file is a very large json file containing 15,816,573 records from the April 2017 DPLA metadata dump.

The first thing that I want to do is to reduce this dataset, which is 6.1GB compressed so something a little more manageable.  I will start with the dc_creator information.  I will use a set of commands for the wonderful tool jq that gets me what I’m wanting.

jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (, 'creator': ._source.sourceResource.creator?}'

The command I used above will transform each of the records in the DPLA dataset into something that looks like this:

{"provider":"","id":"bcae15d47f2544caf0407b1e17bf97cd","creator":["Harlow, G","Rogers, J"]}
{"provider":"","id":"96cab3354d942e7ea2030f1452f5beb8","creator":["Drummond, S","Ridley, W"]}
{"provider":"","id":"e3ce5090d0a8b3c247c84d6f0d5ff16e","creator":["Barber, J.T","Cardon, A"]}

This is now a large file with one small snippet of json on each line.  I can write straightforward Python scripts to process these lines and do some of the heavy lifting for analysis.

For this first pass I’m interested in all of the dc.creators in the whole DPLA dataset to measure the overall compression.

Here is a short set of these values.

Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Committee on Merchant Marine and Fisheries
Herdman, W. A. Sir, (William Abbott), 1858-1924
United States. Committee on Merchant Marine and Fisheries
Henderson, Joseph C
Fancher Creek Nurseries
Roeding, George Christian, 1868-1928
Henry G. Gilbert Nursery and Seed Trade Catalog Collection
United States. Animal and Plant Health Inspection Service
United States. Bureau of Entomology and Plant Quarantine
United States. Plant Pest Control Branch
United States. Plant Pest Control Division

The full list is 10,413,292 lines long when I ignore record instances that don’t have any value for creator.

The next thing to do is sort that list and make it unique which leaves me 1,445,688 unique creators in the DPLA metadata dataset.

Compressing the Data

For the first pass through the data I am going to use the “fingerprint algorithm” that OpenRefine describes in depth here.

The basics are as follows (from OpenRefine’s documentation)

  • remove leading and trailing whitespace
  • change all characters to their lowercase representation
  • remove all punctuation and control characters
  • split the string into whitespace-separated tokens
  • sort the tokens and remove duplicates
  • join the tokens back together
  • normalize extended western characters to their ASCII representation (for example “gödel” → “godel”)

If you’re curious, the code that performs this is in OpenRefine is here.

The next steps are to run this fingerprinting algorithm on each of the1,445,688 creators, sort the created hash values, make them unique and count the resulting lines.  This gives you the new unique creators based on the fingerprint algorithm.

I end up with 1,365,922 unique creator values based on the fingerprint.

That comes to a reduction of 5.52% of the unique values.

To give you an idea of what this looks like for values.  There are eleven different creator instances that have the fingerprint of “akademiia imperatorskaia nauk russia”.

  • Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskai͡a akademii͡a nauk (Russia)
  • Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)
  • Imperatorskaia akademiia nauk (Russia)
  • Imperatorskai͡a akademïi͡a nauk (Russia)
  • Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia)
  • Imperatorskai͡a akademīi͡a nauk (Russia)
  • Imperatorskaia akademīia nauk (Russia)

These 11 different versions of this name are distributed among five different DPLA Hubs.

Below is a table showing how the different versions are distributed across hubs.

Name Records bhl hathitrust internet_archive nypl smithsonian
Imperatorskai︠a︡ akademī︠ia︡ nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 13 0 11 2 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 7 0 7 0 0 0
Imperatorskai͡a akademii͡a nauk (Russia) 3 0 3 0 0 0
Imperatorskai͡a͡ akademīi͡a͡ nauk (Russia) 1 0 1 0 0 0
Imperatorskaia akademīia nauk (Russia) 13 0 0 0 0 13
Imperatorskaia akademiia nauk (Russia) 4 0 0 0 4 0
Imperatorskai͡a akademïi͡a nauk (Russia) 1 0 1 0 0 0
Imperatorskai︠a︡ akademīi︠a︡ nauk (Russia) 11 0 11 0 0 0
Imperatorskai͡a akademīi͡a nauk (Russia) 13 0 13 0 0 0
Imperatorskaia akademīia nauk (Russia) 211 211 0 0 0 0

When you look at the table you will see that bhl, internet_archive, nypl, and smithsonian each have their preferred way of representing this name.  Hathitrust however has eight different ways that it represents this single creator name in its dataset.

Next Steps

This post hopefully introduced the idea of using “field compressions” for name fields like dc.creator, dc.contributor, and dc.publisher as a way of looking at metadata quality in a dataset.

We calculated the amount of compression using OpenRefine’s fingerprint algorithm for the DPLA creator fields.  This ends up being 5.52% compression.

In the next few posts I will compare the different DPLA Hubs to see how they compare with each other.  I will probably play with a few different algorithms for creating the hash values I use.  Finally I will calculate a few metrics in addition to just the unique values (cardinality) of the field.

If you have questions or comments about this post,  please let me know via Twitter.