Metadata Edit Events: Part 5 – Identifying an average metadata editing time.

This is the fifth post in a series of posts related to metadata edit events for the UNT Libraries’ Digital Collections from January 1, 2014 to December 31, 2014.  If you are interested in the previous posts in this series,  they talked about the when, what, who, and first steps of duration.

In this post we are going to try and come up with the “average” amount of time spent on metadata edits in the dataset.

The first thing I wanted to do was to figure out which of the values mentioned in the previous post about duration buckets I could ignore as noise in the dataset.

As a reminder the duration data for metadata edit events is started when a user opens a metadata record in the edit system, and finished when they submit the record back to the system as a publish event.  The duration is the difference in seconds of those two time timestamps.

There are a number of factors that can cause the duration data to vary wildly,  a user can have a number of tabs open at the same time while only working on one of them.  They may open a record and then walk off without editing that record.  They could also be using a browser automation tool like Selenium that automates the metadata edits and therefore pushes the edit time down considerably.

In doing some tests of my own editing skills it isn’t unreasonable to have edits that are four or five seconds in duration if you are going in to change a known value from a simple dropdown. For example adding a language code to a photograph that you know should be “no-language” doesn’t take much time at all.

My gut feeling based on the data in the previous post was to say that edits that have a duration of over one hour should be considered outliers.  This would remove 844 events from the total 94,222 edit events leaving me 93,378 (99%) of the events.  This seemed like a logical first step but I was curious if there were other ways of approaching this.

I had a chat with the UNT Libraries’ Director of Research & Assessment Jesse Hamner and he suggested a few methods for me to look at.

IQR for calculating outliers

I took a stab at using the Interquartile Range of the dataset as the basis for identifying the outliers.  With a little bit of R I was able to find the following information about the duration dataset.

 Min.   :     2.0  
 1st Qu.:    29.0  
 Median :    97.0  
 Mean   :   363.8  
 3rd Qu.:   300.0  
 Max.   :431644.0  

With that I have Q1 of 29 and a Q3 of 300,  this gives me an IQR of 271.

So the range for outliers is Q1–1.5 × IQR  for the low end and Q3+1.5 × IQR on the high end.

With the numbers that says that values under -377.5 or over 706.5 should be considered outliers.

Note: I’m pretty sure there are some different ways of dealing IQR and datasets that end at Zero so that’s something to investigate.

For me the key here is that I’ve come up with 706.5 seconds being the ceiling for a valid event duration based on this method.  Thats 11 minutes and 47 seconds.  If I limit the dataset to edit events that are under 707 seconds  I am left with 83,239 records.  That is now just 88% of the dataset with 12% being considered an outlier.   I thought this seemed to be too many records to ignore so after talking with my resident expert in the library I had a new method.

Two Standard Deviations

I took a look at what the timings would look look like if i based my outliers on the standard deviations.  Edit events that are under 1,300 seconds (21 min 40 sec) in duration amount to 89,547 which is 95% of the values in the dataset.  I also wanted to see what 2.5% of the dataset would look like.  Edit durations under 2,100 seconds (35 minutes) result in 91,916 usable edit events for calculations which is right at 97.6%.

Comparing the methods

The following table takes the four duration ceilings that I tried. (IQR, 95 and 97.5, and gut feeling one hour) and makes them a bit more readable. The total number of duration events in the dataset before limiting is 94,222.

Duration Ceiling Events Remaining Events Removed % remaining
707 83,239 10,983 88%
1,300 89,547 4,675 95%
2,100 91,916 2,306 97.6%
3,600 93,378 844 99%

Just for kicks I calculated the average time spent on editing records across the datasets that remained for the various cutoffs to get an idea how the ceilings changed things.

Duration Ceiling Events Included Events Ignored Mean Stddev Sum Average Edit Duration Total Edit Hours
707 83,239 10,983 140.03 160.31 11,656,340 2:20 3,238
1,300 89,547 4,675 196.47 260.44 17,593,387 3:16 4,887
2,100 91,916 2,306 233.54 345.48 21,466,240 3:54 5,963
3,600 93,378 844 272.44 464.25 25,440,348 4:32 7,067
431,644 94,222 0 363.76 2311.13 34,274,434 6.04 9,521

In the table above you can see how the different duration ceilings do to the data analyzed.  I calculated the mean of the various datasets,  and their standard deviations (really Solr statsComponent did that).  I converted those Means into minutes and seconds in the “Average Edit Duration” column and the final column is the number of person hours that were spent editing metadata in 2014 based on the various datasets.

In going forward I will be using 2,100 seconds as my duration ceiling and ignoring the edit events that took longer than that period of time.  I’m going to do a little work in figuring out the costs associated with metadata creation in our collections for the last year.  So check back for the next post in this series.

As always feel free to contact me via Twitter if you have questions or comments.