This is the last of the updates from our summer’s activities in creating new metadata interfaces for the UNT Libraries Digital Collections. If you are interested in the others in this series you can view the past few posts on this blog where I talk about our facet, count, search, and item interfaces.
This time I am going to talk a bit about our Cluster Dashboard. This interface took a little bit longer than the others to complete. Because of this, we are just rolling it out this week, but it is before Autumn so I’m calling it a Summer interface.
I warn you that there are going to be a bunch of screenshots here, so if you don’t like those, you probably won’t like this post.
For a number of years I have been using OpenRefine for working with spreadsheets of data before we load them into our digital repository. This tool has a number of great features that help you get an overview of the data you are working with, as well as identifying some problem areas that you should think about cleaning up. The feature that I have always felt was the most interesting was their data clustering interface. The idea of this interface is that you choose a facet, (dimension, column) of your data and then group like values together. There are a number of ways of doing this grouping and for an in-depth discussion of those algorithms I will point you to the wonderful OpenRefine Clustering documentation.
OpenRefine is a wonderful tool for working with spreadsheets (and a whole bunch of other types of data) but there are a few challenges that you run into when you are working with data from our digital library collections. First of all our data generally isn’t rectangular. It doesn’t easily fit into a spreadsheet. We have some records with one creator, we have some records with dozens of creators. There are ways to work with these multiple values but things get complicated. The bigger challenge we generally have is that while many systems can generate a spreadsheet of their data for exporting, very few of them (our system included) have a way of importing those changes back into the system in a spreadsheet format. This means that while you could pull data from the system, clean it up in OpenRefine, when you were ready to put it back in the system you would run into the problem that there wasn’t a way to get that nice clean data back into the system. A way that you could use OpenRefine was to identify records to change and then have to go back into the system and change records there. But that is far from ideal.
So how did we overcome this? We wanted to use the OpenRefine clustering but couldn’t get data easily back into our system. Our solution? Bake the OpenRefine clustering right into the system. That’s what this post is about.
The first thing you see when you load up the Cluster Dashboard is a quick bit of information about how many records, collections, and partners you are going to be working on values from. This is helpful to let you know the scope of what you are cluster, both to understand why it might take a while to generate clusters, but also because it is generally better to run these clustering tools over the largest sets of data that you can because it can pull in variations from many different records. Other than that you are presented with a pretty standard dashboard interface from the UNT Libraries’ Edit System. You can limit to subsets of records with the facets on the left side and the number of items you cluster over will change accordingly.
The next thing that you will see is a little help box below the clustering stats. This is a help interface that helps to explain how to use the clustering dashboard and a little more information about how the different algorithms work. Metadata folks generally like to know the fine details about how the algorithms work, or at least be able to find that information if they want to know it later.
The first thing you do is select a field/element/facet that you are interested in clustering. In the example below I’m going to select the Contributor field.
Once you make a selection you can further limit it to a qualifier, in this case you could limit it to just the Contributors that are organizations, or Contributors that are Composers. As I said above, using more data generally works better so we will just run the algorithms over all of the values. You next have the option of choosing an algorithm for your clustering. We recommend to people that they start with the default Fingerprint algorithm because it is a great starting point. I will discuss the other algorithms later in this post.
After you select your algorithm, you hit submit and things start working. You are given a screen that will have a spinner that tells you the clusters are generating.
Depending on your dataset size and the number of unique values of the selected element, you could get your results back on a second or dozens of seconds. The general flow of data after you hit submit is to query the Solr backend for all of the facet values and their counts. These values are then processed with the chosen algorithm that creates a “key” for that value. Another way to think about it is that the values are placed into a bucket that groups similar values together. There are some calculations that are preformed on the clusters and then they are cached for about ten minutes by the system. After you wait for the clusters to generate the first time they are much quicker for the next ten minutes.
In the screen below you can see the results of this first clustering. I will go into detail about the values and options you have to work with the clusters.
The first thing that you might want to do is sort the clusters in a different way. By default they are sorted with the value of the cluster key. Sometimes this makes sense, sometimes it doesn’t make sense as to why something is in a given order. We thought about displaying the key but found that it was also distracting in the interface.
One of the ways that I like to sort the clusters is by the number of cluster Members. The image below shows the clusters with this sort applied.
Here is a more detailed view of a few clusters. You can see that the name of the Russian composer Shostakovich has been grouped into a cluster of 14 members. This represents 125 different records in the system with a Contributor element for this composer. Next to each Member Value you will see a number in parenthesis, this is the number of records that uses that variation of the value.
You can also sort based on the number of records that a cluster contains. This brings up the most frequently used values. Generally there are a large number that have a value and then a few records that have a competing value. Usually pretty easy to fix.
Sorting by the Average Length Variation can help find values that are strange duplications of themselves. Repeated phrases, a double copy and paste, strange things like that come to the surface.
Finally sorting by Average Length is helpful if you want to work with the longest or shortest values that are similar.
I’m going to go through the different algorithms that we currently have in production. Our hope is that as time moves forward we will introduce new algorithms or slight variations of algorithms to really get at some of the oddities of the data in the system. First up is the Fingerprint algorithm. This is a direct clone of the default fingerprint algorithm used by OpenRefine.
A small variation we introduced was instead of replacing punctuation with a whitespace character, the Fingerprint-NS (No Space) just removes the punctuation without adding whitespace. This would group F.B.I with FBI where the other Fingerprint algorithm wouldn’t group them together. This small variation surfaces different clusters. We had to keep reminding ourselves that when we created the algorithms that there wasn’t such a thing as “best”, or “better”, but instead they were just “different”.
One thing that is really common for names in bibliographic metadata is that they have many dates. Birth, death, flourished, and so on. We have a variation of the Fingerprint algorithm that removes all numbers in addition to punctuation. We call this one Fingerprint-ND (No Dates). This is helpful for grouping names that are missing dates with versions of the name that have dates. In the second cluster below I pointed out an instance of Mozart’s name that wouldn’t have been grouped with the default Fingerprint algorithm. Remember, different, not better or best.
From there we branch out into a few simpler algorithms. The Caseless algorithm just lowercases all of the values and you can see clusters that only differ in ways that are related to upper case or lower case values.
Next up is the ASCII algorithm which tries to group together values that only differ in diacritics. So for instance the name Jose and José would be grouped together.
The final algorithm is just a whitespace normalization called Normalize Whitespace, it removes consecutive whitespace characters to group values.
You may have noticed that the number of clusters went down dramatically from the Fingerprint algorithms to the Caseless, ASCII, or Normalize Whitespace, we generally want people to start with the Fingerprint algorithms because they will be useful most of the time.
Other Example Elements
Here are a few more examples from other fields. I’ve gone ahead and sorted them by Members (High to Low) because I think that’s the best way to see the value of this interface. First up is the Creator field.
Next up is the Subject field. We have so so many ways of saying “OU Football”
The real power of this interface is when you start fixing things. In the example below I’m wanting to focus in on the value “Football (O U )”. I do this by clicking the link for that Member Value.
You are taken directly to a result set that has the records for that selected value. In this case there are two records with “Football (O U )”.
All you have to do at this point is open up a record, make the edit and publish that record back. Many of you will say “yeah but wouldn’t some sort of batch editing be faster here?” And I will answer “absolutely, we are going to look into how we would do that!” (but it is a non-trivial activity due to how we manage and store metadata, so sadface 🙁 )
There you have it, the Cluster Dashboard and how it works. The hope is to empower our metadata creators and metadata managers to better understand and if needed, clean up the values in our metadata records. By doing so we are improving the ability for people to connect different records based on common valuse between the records.
As we move forward we will introduce a number of other algorithms that we can use to cluster values. There are also some other metrics that we will look at for sorting records to try and tease out “which clusters would be the most helpful to our users to correct first”. That is always something we are keeping in the back of our head, how can we provide a sorted list of things that are most in need of human fixing. So if you are interested in that sort of thing stay tuned, I will probably talk about it on this blog.
If you have questions or comments about this post, please let me know via Twitter.