Monthly Archives: September 2017

Metadata Quality Interfaces: Cluster Dashboard (OpenRefine Clustering Baked Right In)

This is the last of the updates from our summer’s activities in creating new metadata interfaces for the UNT Libraries Digital Collections.  If you are interested in the others in this series you can view the past few posts on this blog where I talk about our facet, count, search, and item interfaces.

This time I am going to talk a bit about our Cluster Dashboard.  This interface took a little bit longer than the others to complete.  Because of this, we are just rolling it out this week, but it is before Autumn so I’m calling it a Summer interface.

I warn you that there are going to be a bunch of screenshots here, so if you don’t like those, you probably won’t like this post.

Cluster Dashboard

For a number of years I have been using OpenRefine for working with spreadsheets of data before we load them into our digital repository.  This tool has a number of great features that help you get an overview of the data you are working with, as well as identifying some problem areas that you should think about cleaning up.  The feature that I have always felt was the most interesting was their data clustering interface.  The idea of this interface is that you choose a facet, (dimension, column) of your data and then group like values together.  There are a number of ways of doing this grouping and for an in-depth discussion of those algorithms I will point you to the wonderful OpenRefine Clustering documentation.

OpenRefine is a wonderful tool for working with spreadsheets (and a whole bunch of other types of data) but there are a few challenges that you run into when you are working with data from our digital library collections.  First of all our data generally isn’t rectangular.  It doesn’t easily fit into a spreadsheet.  We have some records with one creator, we have some records with dozens of creators.  There are ways to work with these multiple values but things get complicated. The bigger challenge we generally have is that while many systems can generate a spreadsheet of their data for exporting, very few of them (our system included) have a way of importing those changes back into the system in a spreadsheet format.  This means that while you could pull data from the system, clean it up in OpenRefine, when you were ready to put it back in the system you would run into the problem that there wasn’t a way to get that nice clean data back into the system. A way that you could use OpenRefine was to identify records to change and then have to go back into the system and change records there. But that is far from ideal.

So how did we overcome this? We wanted to use the OpenRefine clustering but couldn’t get data easily back into our system.  Our solution?  Bake the OpenRefine clustering right into the system.  That’s what this post is about.

The first thing you see when you load up the Cluster Dashboard is a quick bit of information about how many records, collections, and partners you are going to be working on values from.  This is helpful to let you know the scope of what you are cluster, both to understand why it might take a while to generate clusters, but also because it is generally better to run these clustering tools over the largest sets of data that you can because it can pull in variations from many different records.  Other than that you are presented with a pretty standard dashboard interface from the UNT Libraries’ Edit System. You can limit to subsets of records with the facets on the left side and the number of items you cluster over will change accordingly.

Cluster Dashboard

The next thing that you will see is a little help box below the clustering stats. This is a help interface that helps to explain how to use the clustering dashboard and a little more information about how the different algorithms work.  Metadata folks generally like to know the fine details about how the algorithms work, or at least be able to find that information if they want to know it later.

Cluster Dashboard Help

The first thing you do is select a field/element/facet that you are interested in clustering. In the example below I’m going to select the Contributor field.

Choosing an Element to Cluster

Once you make a selection you can further limit it to a qualifier, in this case you could limit it to just the Contributors that are organizations, or Contributors that are Composers.  As I said above, using more data generally works better so we will just run the algorithms over all of the values. You next have the option of choosing an algorithm for your clustering.  We recommend to people that they start with the default Fingerprint algorithm because it is a great starting point.  I will discuss the other algorithms later in this post.

Choosing an Algorithm

After you select your algorithm, you hit submit and things start working.  You are given a screen that will have a spinner that tells you the clusters are generating.

Generating Clusters

Depending on your dataset size and the number of unique values of the selected element, you could get your results back on a second or dozens of seconds.  The general flow of data after you hit submit is to query the Solr backend for all of the facet values and their counts.  These values are then processed with the chosen algorithm that creates a “key” for that value.  Another way to think about it is that the values are placed into a bucket that groups similar values together.  There are some calculations that are preformed on the clusters and then they are cached for about ten minutes by the system.  After you wait for the clusters to generate the first time they are much quicker for the next ten minutes.

In the screen below you can see the results of this first clustering.  I will go into detail about the values and options you have to work with the clusters.

Contributor Clusters with Fingerprint Key Collision Hashing

The first thing that you might want to do is sort the clusters in a different way.  By default they are sorted with the value of the cluster key.  Sometimes this makes sense, sometimes it doesn’t make sense as to why something is in a given order.  We thought about displaying the key but found that it was also distracting in the interface.

Different ways of sorting clusters

One of the ways that I like to sort the clusters is by the number of cluster Members.  The image below shows the clusters with this sort applied.

Contributor Field sorted by Members

Here is a more detailed view of a few clusters.  You can see that the name of the Russian composer Shostakovich has been grouped into a cluster of 14 members.  This represents 125 different records in the system with a Contributor element for this composer.  Next to each Member Value you will see a number in parenthesis, this is the number of records that uses that variation of the value.

Contributor Cluster Detail

You can also sort based on the number of records that a cluster contains.  This brings up the most frequently used values.  Generally there are a large number that have a value and then a few records that have a competing value.  Usually pretty easy to fix.

Contributor Element sorted by Records

Sorting by the Average Length Variation can help find values that are strange duplications of themselves.  Repeated phrases, a double copy and paste, strange things like that come to the surface.

Contributor Element sorted by Average Length Variation

Finally sorting by Average Length is helpful if you want to work with the longest or shortest values that are similar.

Contributor Element sorted by Average Length

Different Algorithms

I’m going to go through the different algorithms that we currently have in production.  Our hope is that as time moves forward we will introduce new algorithms or slight variations of algorithms to really get at some of the oddities of the data in the system.  First up is the Fingerprint algorithm.  This is a direct clone of the default fingerprint algorithm used by OpenRefine.

Contributor Element Clustered using Fingerprint Key Collision

A small variation we introduced was instead of replacing punctuation with a whitespace character, the Fingerprint-NS (No Space) just removes the punctuation without adding whitespace.  This would group F.B.I with FBI where the other Fingerprint algorithm wouldn’t group them together.  This small variation surfaces different clusters.  We had to keep reminding ourselves that when we created the algorithms that there wasn’t such a thing as “best”, or “better”, but instead they were just “different”.

Contributor Element Clustered using Fingerprint (No Space) Key Collision

One thing that is really common for names in bibliographic metadata is that they have many dates.  Birth, death, flourished, and so on.  We have a variation of the Fingerprint algorithm that removes all numbers in addition to punctuation.  We call this one Fingerprint-ND (No Dates).  This is helpful for grouping names that are missing dates with versions of the name that have dates.  In the second cluster below I pointed out an instance of Mozart’s name that wouldn’t have been grouped with the default Fingerprint algorithm.  Remember, different, not better or best.

Contributor Element Clustered using Fingerprint (No Dates) Key Collision

From there we branch out into a few simpler algorithms.  The Caseless algorithm just lowercases all of the values and you can see clusters that only differ in ways that are related to upper case or lower case values.

Contributor Element Clustered using Caseless (lowercase) Key Collision

Next up is the ASCII algorithm which tries to group together values that only differ in diacritics.  So for instance the name Jose and José would be grouped together.

Contributor Element Clustered using ASCII Key Collision

The final algorithm is just a whitespace normalization called Normalize Whitespace, it removes consecutive whitespace characters to group values.

Contributor Element Clustered using Normalized Whitespace Key Collision

You may have noticed that the number of clusters went down dramatically from the Fingerprint algorithms to the Caseless, ASCII, or Normalize Whitespace, we generally want people to start with the Fingerprint algorithms because they will be useful most of the time.

Other Example Elements

Here are a few more examples from other fields.  I’ve gone ahead and sorted them by Members (High to Low) because I think that’s the best way to see the value of this interface.  First up is the Creator field.

Creator Element clustered with Fingerprint algorithm and sorted by Members

Next up is the Subject field.  We have so so many ways of saying “OU Football”

Subject Element clustered with Fingerprint algorithm and sorted by Members

The real power of this interface is when you start fixing things.  In the example below I’m wanting to focus in on the value “Football (O U )”.  I do this by clicking the link for that Member Value.

Subject Element Cluster Detail

You are taken directly to a result set that has the records for that selected value.  In this case there are two records with “Football (O U )”.

Selected Records

All you have to do at this point is open up a record, make the edit and publish that record back. Many of you will say “yeah but wouldn’t some sort of batch editing be faster here?”  And I will answer “absolutely,  we are going to look into how we would do that!” (but it is a non-trivial activity due to how we manage and store metadata, so sadface 🙁 )

Subject Value in the Record

There you have it, the Cluster Dashboard and how it works.  The hope is to empower our metadata creators and metadata managers to better understand and if needed, clean up the values in our metadata records.  By doing so we are improving the ability for people to connect different records based on common valuse between the records.

As we move forward we will introduce a number of other algorithms that we can use to cluster values.  There are also some other metrics that we will look at for sorting records to try and tease out “which clusters would be the most helpful to our users to correct first”.  That is always something we are keeping in the back of our head,  how can we provide a sorted list of things that are most in need of human fixing.  So if you are interested in that sort of thing stay tuned, I will probably talk about it on this blog.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Interfaces: Search Dashboard

This is the next blog post in a series that discusses some of the metadata interfaces that we have been working on improving over the summer for the UNT Libraries Digital Collections.  You can catch up on those posts about our Item Views, Facet Dashboard, and Element Count Dashboard if you are curious.

In this post I’m going to talk about our Search Dashboard.  This dashboard is really the bread and butter of our whole metadata editing application.  About 99% of the time a user who is doing some metadata work will login and work with this interface to find the records that they need to create or edit. The records that they see and can search are only ones that they have privileges to edit.  In this post you will see what I see when I login to the system, the nearly 1.9 million records that we are currently managing in our systems.

Let’s get started.

Search Dashboard

If you have read the other post you will probably notice quite a bit of similarity between the interfaces.  All of those other interfaces were based off of this search interfaces.  You can divide the dashboard into three primary sections.  On the left side there are facets that allow you to refine your view in a number of ways.  At the top of the right column is an area where you can search for a term or phrase in a record you are interested in.  Finally under the search box there is a result set of items and various ways to interact with those results.

By default all the records that you have access to are viewable if you haven’t refined your view with a search or a limiting facet.

Edit Interface Search Dashboard

The search section of the dashboard lets you find a specific record or set of records that you are interested in working with.  You can choose to search across all of the fields in the metadata record or just a specific metadata field using the dropdown next to where you enter your search term.  You can search single words, phrases, or unique identifiers for records if you have those.  Once you hit the search button you are on your way.

Search and View Options for Records

Once you have submitted your search you will get back a set of results.  I’ll go over these more in depth in a little bit.

Record Detail

You can sort your results in a variety of ways.  By default they are returned in Title order but you can sort them by the date they were added to the system, the date the original item was created, the date that the metadata record was last modified, the ARK identifier and finally by a completeness metric.   You also have the option to change your view from the default list view to the grid view.

Sort Options

Here is a look at the grid view.  It presents a more visually compact view of the records you might be interested in working with.

Grid View

The image below is a detail of a record view. We tried to pack as much useful information into each row as we  could.  We have the title, a thumbnail, several links to either the edit or summary item view on the left part of the row.  Following that we have the system, collection, and partner that the record belongs to. We have the unique ARK identifier for the object, the date that it was added to the UNT Libraries’ Digital Collections, and the date the metadata was last modified.  Finally we have a green check if the item is visible to the public or a red X if the item is hidden from the public.

Record Detail

Facet Section

There are a number of different facets that a user can use to limit the records they are working with to a smaller subset.  The list is pretty long so I’ll first show you it in a single image and then go over some of the specifics in more detail below.

Facet Options

The first three facets are the system, collection and partner facets.  We have three systems that we manage records for with this interface, The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.

Each digital item can belong to multiple collections and generally belongs to a single partner organization.  If you are interested in just working on the records for the KXAS-NBC 5 New Collection you can limit your view of records by selecting that value from the Collections facet area.

System, Collections and Partners Facet Options

Next are the Resource Type and Visibility facets.  It is often helpful to limit to just a specific resource type, like Maps when you are doing your metadata editing so that you don’t see things that you aren’t interested in working with.  Likewise there are some kinds of metadata editing that you want to focus primarily on items that are already viewable to the public and you don’t want the hidden records to get in the way. You can do this with the Visibility facet.

Resource Type and Visibility Facet Options

Next we start getting into the new facet types that we added this summer to help identify records that need some metadata uplift.  We have the Date Validity, My Edits, and Location Data facets.

Date Validity is a facet that allows you to identify records that have dates in them that are not valid according to the Extended Date Time Format (EDTF).  There are two different fields in a record that are checked, the date field and the coverage field (which can contain dates).  If any of these aren’t valid EDTF strings then we mark the whole record as having Invalid Dates.  You can use this facet to identify these and go in a correct those values.

Next up is a facet for just the records that you have edited in the past.  This can be helpful for a number of reasons.  I use it from time to time to see if any of the records that I’ve edited have developed any issues like dates that aren’t valid since I last edited them.  It doesn’t happen often but can be helpful.

Finally there is a section of Location Data.  This set of facets is helpful for identifying records which have or don’t have a Place Name, Place Point, or Place Box in the record.  Helpful if you are working through a collection trying to add geographic information to the records.

Date Validity, My Edits, and Location Data Facet Options

The final set of facets are Recently Edited Records, and Record Completeness.  The first is the Recently Edited Records which is pretty straight forward.  This just a listing of how many records have been edited in the past 24h, 48h, 7d, 30d, 180d, 365d in the system.  One note that causes a bit of confusion here is that these are records that are edited by  anyone in the past period of time.  It is often misunderstood as “your edits” in a given period of time which isn’t true.  Still very helpful but can get you into some strange results if you think about it the other way.

The last facet value is for the Record Completeness. We really have two categories, records that have a completeness of 1.0 (Complete Records) or records that are less than 1.0 (Incomplete Records).  This metric is calculated when the item is indexed in the system and based on our notion of a minimally viable record.

Recently Edited Records and Record Completeness Facet Options

This finishes this post about the Search Dashboard for the UNT Libraries Digital Collections.  We have been working to build out this metadata environment for about the last eight years and have slowly refined it to the metadata creation and editing workflows that seem to work for the widest number of folks here at UNT.  There are always improvements that we can make and we have been steadily chipping away at those over time.

There are a few other things that we’ve been working on over the summer that I will post about in the next week or so, so stay tuned for more.

If you have questions or comments about this post,  please let me know via Twitter.

Metadata Quality Interfaces: Element Count Dashboard

Next up in our review of the new metadata quality interfaces we have implemented this summer is our Element Count Dashboard.

The basics of this are that whenever we index metadata records in our Solr index we go ahead and count the number of instances of a given element, or a given element with a specific qualifier and store those away in the index.  This results in hundreds of fields that are the counts of element instances in those fields.

We built an interface on top of these counts because we had a hunch that we would be able to use this information to help us identify problems in our metadata records.  It feels like I’m showing some things in our metadata that we probably don’t want to really highlight but it is all for helping others understand.  So onward!

Element Count Dashboard

The dashboard is similar to other dashboards in the Edit system.  You have the ability to limit your view to just the collection, partner or system you are interested in working with.

Count Dashboard

From there you can select an element you are interested in viewing counts for.  In the example below I am interested in looking at the Description element or field.

Select an Element to View Counts

Once your selection is made you are presented with the number of instances of the description field in a record.  This is a little more helpful if you know that in our metadata world, a nice clean record will generally have two description fields.  One for a content description and one for a physical description of the item. More than two is usually strange and less than one is usually bad.

Counts for Description Elements

To get a clearer view you can see the detail below.  This again is for the top level Description element where we like to have two descriptions.

Detail of Description Counts

You can also limit to a qualifier specifically.  In the example below you see the counts of Description elements with a content qualifier.  The 1,667 records that have two Description elements with a content qualifier are pretty strange.  We should probably fix those.

Detail of Description Counts for Content Qualifier

Next we limit to just the physical description qualifier. You will see that there are a bunch that don’t have any sort of physical description and then 76 that have two. We should fix both of those record sets.

Detail of Description Counts for Physical Qualifier

Because of the way that we index things we can also get at the Description elements that don’t have either a content or physical qualifier selected.  These are identified with a value of none for the qualifier.  You can see that there are 1,861,356 records that have zero Description elements with a none qualifier.  That’s awesome.  You can also see 52 that have one element and 261 that have two elements that are missing qualifiers.  That’s not awesome.

Detail of Description Counts for None Qualifier

I’m hoping you are starting to see how this kind of interface could be useful to drill into records that might look a little strange.  When you identify something strange all you have to do is click on the number and you are taken directly to the records that match what you’ve asked for.  In the example below we are seeing all 76 of the records that have two physical descriptions because this is something we are interested in correcting.

Records with Multiple Duplicate Physical Qualifiers

If you open up a record to edit you will see that yes, in fact there are two Physical Descriptions in this record. It looks like the first one should actually be a Content Description.

Example of two physical descriptions that need to be fixed

Once we change that value we can hit the Publish button and be on our way fixing other metadata records.  The counts will update about thirty seconds later to reflect the corrections that you have made.

Fixed Physical and Content Descriptions

Even more of a good thing.

Because I think this is a little different than other interfaces you might be used to, it might be good to see another example.

This time we are looking at the Creator element in the Element Count Dashboard.

Creator Counts

You will see that there are 112 different counts from zero way up into way way too many creators on an item (silly physics articles).

I was curious to see what the counts looked like for Creator elements that were missing a role qualifier.  These are identified by selecting the none value from the qualifier dropdown.

Creator Counts for Missing Qualifiers

You can see that the majority of our records don’t have Creator elements missing the role qualifier but there are a number that do.  We can fix those.  If you wanted to look at those records that have five different Creator elements that don’t have a role you would end up getting to records that loo like the one below.

Example of Multiple Missing Types and Roles

You will notice that when a record has a problem there are often multiple things wrong with it. In this case not only is it missing role information for each of these Creator elements but there is also name type information that is missing.  Once we fix those we can move along and edit some more.

And a final example.

I’m hoping you are starting to see how this interface could be useful.  Here is another example if you aren’t convinced yet.  We are completing a retrospective digitization of theses and dissertations here at UNT.  Not only is this a bunch of digitization but it is quite a bit of metadata that we are adding to both the UNT Digital Library as well as our traditional library catalog.   Let’s look at some of those records.

You can limit your dashboard view to the collection you are interested in working on.  In this case we choose the UNT Theses and Dissertations collection.

Next up we take a look at the number of Creator elements per record. Theses and dissertations are generally authored by just one person.  It would be strange to see counts other than one.

Creator Counts for These and Dissertations Collection

It looks lie there are 26 records that are missing Creator elements and a single record that for some reason has two Creator elements.  This is strange and we should take a look.

Below you will see the view of the 26 records that are missing a Creator element.  Sadly at the time of writing there are seven of these that are visible to the public so that’s something we really need to fix in a hurry.

Example Theses that are Missing Creators

That’s it for this post about our Element Count Dashboard.  I hope that you find this sort of interface interesting.  I’d be interested to hear if you have interfaces like this for your digital library collections or if you think something like this would be useful in your metadata work.

If you have questions or comments about this post,  please let me know via Twitter.