How do metadata records change over time?

Since September 2009 the UNT Libraries has been versioning the metadata edits that happen in the digital library system that powers The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.  In those eight years the collection has grown from a modest size of 66,000 digital objects to the 1,814,000 digital objects that we manage today.  We’ve always tried to think of the metadata in our digital library as a constantly changing dataset.  Just how much it changes we don’t always pay attention to.

In 2014 a group of us worked on a few papers about metadata change at a fairly high level in the repository. How Descriptive Metadata Changes in the UNT Libraries Collections: A Case Study  this paper reported out on the analysis of almost 700,000 records that were in the repository at that time.  Another study Exploration of Metadata Change in a Digital Repository was presented the following year in 2015 by some colleagues in the UNT College of Information that used a smaller sample of records to answer a few more questions about what changes in descriptive metadata at the UNT Libraries.

It has been a few years since these studies so it is time again to take a look at our metadata and do a little analysis to see if anything pops out.

Metadata Edit Dataset

The dataset we are using for this analysis was generated on May 4th, 2017 by creating a copy of all of the metadata records and their versions to a local filesystem for further analysis. The complete dataset is for 1,811,640 metadata records.

Of those 1,811,640 metadata records, 683,933 had been edited at least once since they were loaded into the repository.  There are 62% of the records that have just one instance (no changes) in the system and another 38% that have at least one edit.

Records Edited in Dataset

We store all of our metadata on the filesystem as XML files using a local metadata format we call UNTL.  When a record is edited, the old version of the record is renamed with a version number and the new version of the record takes its place as the current version of a record. This has worked pretty well over the years for us and allows us to view previous versions of metadata records through a metadata history screen in our metadata system.

UNT Metadata History Interface

This metadata history view is helpful for tracking down strange things that happen in metadata systems from time to time.  Because some records are edited multiple times (like in the example screenshot above) we end up with a large number of metadata edits that we can look at over time.

After staging all of the metadata records on a local machine I wrote a script that would compare two different records and output which elements in the record changed. While this sounds like a pretty straight forward thing to do, there are some fiddly bits that you need to watch out for that I will probably cover in a separate blog post. Most of these have to do with XML as a serialization format and some of the questions on how you interpret different things.  As a quick example think about these three notations.

<title></title>
<title />
<title qualifier='officialtitle'></title>

When comparing fields should those three examples all mean the same thing as far as a metadata record is concerned?  But like I said something to get into for a later post.

Once I had my script to compare two records, the next step was to create pairs of records to compare and then iterate over all of those record pairs.  This resulted in 1,332,936 edit events that I could look at.  I created a JSON document for each of these edit events and then loaded this document into Solr for some later analysis.  Here is what one of these records looks like.

{
  "change_citation": 0,
  "change_collection": 0,
  "change_contributor": 1,
  "change_coverage": 0,
  "change_creator": 0,
  "change_date": 0,
  "change_degree": 0,
  "change_description": 0,
  "change_format": 0,
  "change_identifier": 1,
  "change_institution": 0,
  "change_meta": 1,
  "change_note": 0,
  "change_primarySource": 0,
  "change_publisher": 0,
  "change_relation": 0,
  "change_resourceType": 0,
  "change_rights": 0,
  "change_source": 0,
  "change_subject": 0,
  "change_title": 0,
  "collections": [
    "NACA",
    "TRAIL"
  ],
  "completeness_change": 0,
  "content_length_change": 12,
  "creation_to_edit_seconds": 123564535,
  "edit_number": 1,
  "elements_changed": 3,
  "id": "metadc58589_2015-10-16T11:02:09Z",
  "institution": [
    "UNTGD"
  ],
  "metadata_creation_date": "2011-11-16T07:33:14Z",
  "metadata_edit_date": "2015-10-16T11:02:09Z",
  "metadata_editor": "htarver",
  "r1_ark": "ark:/67531/metadc58589",
  "r1_completeness": 0.9830508474576272,
  "r1_content_length": 2108,
  "r1_record_length": 2351,
  "r2_ark": "ark:/67531/metadc58589",
  "r2_completeness": 0.9830508474576272,
  "r2_content_length": 2120,
  "r2_record_length": 2543,
  "record_length_change": 192,
  "systems": "DC"
}

Some of the fields don’t mean much now but the main fields we want to look at are the change_* fields.  These represent the 21 metadata elements that we have use here for the UNTL metadata format.  Here they are in a more compact view.

  • title
  • creator
  • contributor
  • publisher
  • date
  • description
  • subject
  • primarySource
  • coverage
  • source
  • citation
  • relation
  • collection
  • institution
  • rights
  • resourceType
  • format
  • identifier
  • degree
  • note
  • meta

You may notice that these elements include the 15 Dublin Core elements plus six other fields that we’ve found useful to have in our element set.

The first thing I wanted to answer was which of these 21 fields was edited the most in the 1.3 million records edits that we have.

Metadata Element Changes

You can see that the meta field in the records changes almost 100% of the time.  That is because whenever you edit a record the values of the most recent metadata editor and the edit time change so the values of the elements should change each edit.

I have to admit that I was surprised that the description field was the most edited field in the metadata edits.  There were 403,713 (30%) of the edits that had the description field change in some way. This is followed by title at 304,396 (23%) and subject at 272,703 (20%).

There are a number of other things that I will be doing with this dataset as I move forward. In addition to what fields changed I should be able to look at how many field on average change in records.  I then want to see if there are any noticeable differences when you look at different subsets like specific editors, or collections.

So if you are interested in metadata change stay tuned.

If you have questions or comments about this post,  please let me know via Twitter.