Monthly Archives: April 2015

Creator and Use Data for the UNT Scholarly Works Repository

I had a question asked this last week about what was the most “used” item in our UNT Scholarly Works Repository,  that led to discussion of the most “used” creator across that same collection.  I spent a few minutes going through the process of pulling this data and thought that it would make a good post and allow me to try out writing some step by step instructions.

Here are the things that I was interested in.

  1. What creator has the most items where they are an author or co-author in the UNT Scholarly Works Repository?
  2. What is the most used item in the repository?
  3. What author has the highest “average item usage” ?
  4. How do these lists compare?

In order to answer these questions there are a number of steps that I had to go through in order to get the final data.  This post will walk us through the steps later.

  1. Get a list of the item identifiers in the collection
  2. Grab the stats and metadata for each of the identifiers
  3. Convert metadata and stats into a format that can be processed
  4. Add up uses per item, per author, sort and profit.

So here we go.

Downloading the identifiers.

We have a number of API’s for each collection in our digital library.  These are very very simple APIs compared to some of those offered by other systems,  and in many cases our primary API consists of technologies like OAI-PMH, OpenSearch and simple text lists or JSON files.  Here is the documentation for the APIs available for the UNT Scholarly Works Repository.   For this project the API I’m interested in is the identifiers list.  If you go to this URL http://digital.library.unt.edu/explore/collections/UNTSW/identifiers/ you can get all of the public identifiers for the collection.

Here is the WGET command that I use to grab this file and to save it as a file called untsw.arks

[vphill]$ wget http://digital.library.unt.edu/explore/collections/UNTSW/identifiers/ -O untsw.arks

Now that we have this file we can quickly get a count for the total number of items we will be working with by using the wc command.

[vphill]$ wc -l untsw.arks
3731 untsw.arks

We can quickly see that there are 3,731 identifiers in this file.

Next up we want to adjust that arks file a bit to get at just the name part of the ark,  locally we call these either meta_ids or ids for short.  I will use the sed command to get rid of the ark:/67531/ part of each line and then save the resulting line as a new file.  Here is that command

sed "s/ark:/67531///" untsw.arks > untsw.ids

Now we have a file untsw.ids that looks like this:

metadc274983
metadc274993
metadc274992
metadc274991
metadc274998
metadc274984
metadc274980
metadc274999
metadc274985
metadc274995

We will use this file to now grab the metadata and usage stats for each item.

Downloading Stats and Metadata

For this step we will make use of an undocumented API for our system,  internally it is called the “resource_object”.  For a given item http://digital.library.unt.edu/ark:/67531/metadc274983/ if you append resource_object.json you will get the JSON representation of the resource object we use for all of our templating in the system.  http://digital.library.unt.edu/ark:/67531/metadc274983/resource_object.json is the resulting URL.  Depending on the size of the object, this resource object could be quite large because it has a bunch of data inside.

Two pieces of data that are important to us are the usage stats and the metadata for the item itself.   We will make use of wget again to grab this info,  and a quick loop to help automate the process a bit more.  Before we grab all of these files we want to create a folder called “data” to store content in.

[vphill]$ mkdir data
[vphill]$ for i in in `cat untsw.ids` ; do wget -nc "http://digital.library.unt.edu/ark:/67531/$i/resource_object.json" -O data/$i.json ; done

What this does,  first we create a directory called data with the mkdir command.

Next we loop over all of the lines in the untsw.ids file by using the cat command to read the file.  Each line or iteration of the loop,  the variable $i will contain a new meta_id from the file.

Each iteration of the loop we will use wget to grab the resource_object.json and save it to a json file in the data directory named using the meta_id with .json appended to the end.

I’ve added the -nc option to wget that means “no clobber” so if you have to restart this step it won’t try and re-download items that have already been downloaded.

This step can take a few minutes depending on the size of the collection you are pulling.  I think it took about 15 minutes for my 3,731 items in the UNT Scholarly Works Repository.

Converting the Data

For this next section I have three bits of code that I use to get at the data inside of the JSON files that we downloaded in  “data” folder.  I suggest now creating a “code” folder using the mkdir again so that we can place the following python scripts into them.  The names for each of these files are as follows: get_creators.py, get_usage.py, and reducer.py.

#get_creators.py

import sys
import json

if len(sys.argv) != 2:
    print "usage: %s <untl resource object json file>" % sys.argv[0]
    exit(-1)

filename = sys.argv[1]
data = json.loads(open(filename).read())

total_usage = data["stats"]["total"]
meta_id = data["meta_id"]

metadata = data["desc_MD"].get("creator", [])

creators = []
for i in metadata:
    creators.append(i["content"]["name"].replace("t", " "))

for creator in creators:
   out = "t".join([meta_id, creator, str(total_usage)])
   print out.encode('utf-8')

Copy the above text into a file inside your “code” folder called get_creators.py

#get_usage.py
import sys
import json

if len(sys.argv) != 2:
    print "usage: %s <untl resource object json file>" % sys.argv[0]
    exit(-1)

filename = sys.argv[1]
data = json.loads(open(filename).read())

total_usage = data["stats"]["total"]
meta_id = data["meta_id"]
title = data["desc_MD"]["title"][0]["content"].replace("t", " ")

out = "t".join([meta_id, str(total_usage), title])
print out.encode("utf-8")

Copy the above text into a file inside your “code” folder called get_usage.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    # current_word - string containing a word (the key)
    # group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

Copy the above text into a file inside your “code” folder called reducer.py

Now that we have these three scripts,  I want to loop over all of the JSON files in the data directory and pull out information from them.  First we use the get_usage.py script and redirect the output of that script to a file called usage.txt

[vphill]$ for i in data/*.json ; do python code/get_usage.py "$i" ; done > usage.txt

Here is what that file looks like when you look at the first ten lines.

metadc102275 447 Feeling Animal: Pet-Making and Mastery in the Slave's Friend
metadc102276 48 An Extensible Approach to Interoperability Testing: The Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs
metadc102277 114 Using Assessment to Guide Strategic Planning
metadc102278 323 This Side of the Border: The Mexican Revolution through the Lens of American Photographer Otis A. Aultman
metadc102279 88 Examining MARC Records as Artifacts That Reflect Metadata Utilization Decisions
metadc102280 155 Genetic Manipulation of a "Vacuolar" H+ -PPase: From Salt Tolerance to Yield Enhancement under Phosphorus-Deficient Soils
metadc102281 82 Assessing Interoperability in the Networked Environment: Standards, Evaluation, and Testbeds in the Context of Z39.50
metadc102282 67 Is It Really That Bad? Verifying the extent of full-text linking problems
metadc102283 133 The Hunting Behavior of Black-Shouldered Kites (Elanus Caeruleus Leucurus) in Central Chile
metadc102284 199 Ecological theory and values in the determination of conservation goals: examples from temperate regions of Germany, United States of America, and Chile

It is a tab delimited file with three fields,  the meta_id,  the usage count and finally the title of the item.

The next thing we want to do is create another list of creators and their usage data.  We do that in a similar was as in the previous step.  The command below should get you where you want to go.

[vphill]$ for i in data/* ; do python code/get_creators.py "$i" ; done > creators.txt

Here is a sample of what this file looks like.

metadc102275 Keralis, Spencer D. C. 447
metadc102276 Moen, William E. 48
metadc102276 Hammer, Sebastian 48
metadc102276 Taylor, Mike 48
metadc102276 Thomale, Jason 48
metadc102276 Yoon, JungWon 48
metadc102277 Avery, Elizabeth Fuseler 114
metadc102278 Carlisle, Tara 323
metadc102279 Moen, William E. 88
metadc102280 Gaxiola, Roberto A. 155

Here again you have a tab delimited file with the meta_id, name and usage for that name in that item.  You can see that there are five entries for the item metadc102276 because there were five creators for that item.

Looking at the Data

The final step (and the thing that we’ve been waiting for is to actually do some work with this data.  This is easy to do with a few standard unix/linux command line tools. The work below will make use of the tools wc, sort, uniq, cut, and head

Most used items

The first thing that we can do with the usage.txt file is to see which items were used the most.   If we use the following command you can get at this data.

[vphill]$ sort -t$'t' -k 2nr usage.txt | head

We need to sort the usage.txt file by the second column with the data being treated as numeric data.  We would like this in reverse order or from the largest to the smallest.  The sort command we use above uses the -t option to say that we want to treat the tab character as the delimiter instead of the default space character and the the -k option says to use the second column as a number in reverse order.  We pipe this output to the head program which take the first ten results and spits them out.  We should have something that looks like the following (formatted to a table for easier reading).

meta_id usage title
metadc30374 5,153 Appendices To: The UP/SP Merger: An Assessment of the Impacts on the State of Texas
metadc29400 5,075 Remote Sensing and GIS for Nonpoint Source Pollution Analysis in the City of Dallas’ Eastern Watersheds
metadc33126 4,691 Research Consent Form: Focus Groups and End User Interviews
metadc86949 3,712 The First World War: American Ideals and Wilsonian Idealism in Foreign Policy
metadc33128 3,512 Summary Report of the Needs Assessment
metadc86874 2,986 Synthesis and Characterization of Nickel and Nickel Hydroxide Nanopowders
metadc86872 2,886 Depression in college students: Perceived stress, loneliness, and self-esteem
metadc122179 2,766 Cross-Cultural Training and Success Versus Failure of Expatriates
metadc36277 2,564 What’s My Leadership Color?
metadc29807 2,489 Bishnoi: An Eco-Theological “New Religious Movement” In The Indian Desert

Creators with the most uses

The next thing we want to do is look at the creators that had the most collective uses in the entire dataset.  For this we use the creators.txt file and grab only the name and usage field.  We then sort by the name field so they are all in alphabetical order.  We use the reducer.py script to add up the uses for each name (must be sorted before you do this step) and then we pipe that to the sort program again.  Here is the command.

[vphill]$ cut -f 2,3 creators.txt | sort | python code/reducer.py | sort -t$'t' -k 2nr | head

Hopefully there are portions of the above command that are recognizable from the previous example (sorting by the second column and head) with some new things thrown in.  Again I’ve converted the output to a table for easier viewing.

Creator Total Aggregated Uses per Creator
Murray, Kathleen R. 24,600
Mihalcea, Rada, 1974- 23,960
Cundari, Thomas R., 1964- 20,903
Phillips, Mark Edward 20,023
Acree, William E. (William Eugene) 18,930
Clower, Terry L. 14,403
Alemneh, Daniel Gelaw 13,069
Weinstein, Bernard L. 13,008
Moen, William E. 12,615
Marshall, James L., 1940- 8,692

Publications Per Creator

Another thing that is helpful is to pull the list of publications per author which we can do easily with our creators.txt list.

Here is the command we will want to use.

[vphill]$ cut -f 2 creators.txt | sort | uniq -c | sort -nr | head

This command should be familiar from previous examples, the new command that I’ve added is uniq with the option to count the unique instances of each name. I then sort on that count in reverse order (highest to lowest) and take the top ten results.

The output will look something like this

 267 Acree, William E. (William Eugene)
 161 Phillips, Mark Edward
 114 Alemneh, Daniel Gelaw
 112 Cundari, Thomas R., 1964-
 108 Mihalcea, Rada, 1974-
 106 Grigolini, Paolo
  90 Falsetta, Vincent
  87 Moen, William E.
  86 Dixon, R. A.
  85 Spear, Shigeko

To keep up with the formatted tables, here are the top ten most prolific creators in the UNT Scholarly Works Repository.

Creators Items
Acree, William E. (William Eugene) 267
Phillips, Mark Edward 161
Alemneh, Daniel Gelaw 114
Cundari, Thomas R., 1964- 112
Mihalcea, Rada, 1974- 108
Grigolini, Paolo 106
Falsetta, Vincent 90
Moen, William E. 87
Dixon, R. A. 86
Spear, Shigeko 85

Average Use Per Item

A bonus exercise you can do is combine the creators use counts with the number of items they have in the repository to identify their average item usage number.   I did that to the top ten creators by overall use and you can see how that shows some interesting things too.

Name Total Aggregate Uses Items Use Per Item Ratio
Murray, Kathleen R. 24,600 65 378
Mihalcea, Rada, 1974- 23,960 108 222
Cundari, Thomas R., 1964- 20,903 112 187
Phillips, Mark Edward 20,023 161 124
Acree, William E. (William Eugene) 18,930 267 71
Clower, Terry L. 14,403 54 267
Alemneh, Daniel Gelaw 13,069 114 115
Weinstein, Bernard L. 13,008 49 265
Moen, William E. 12,615 87 145
Marshall, James L., 1940- 8,692 71 122

It is interesting to see that Murray, Kathleen R. has both the highest aggregate uses as well as the highest Use Per Item Ratio.  Other authors like Acree, William E. (William Eugene) who have many publications go down a bit in rank if you ordered by Use Per Item Ratio.

Conclusion

Depending on what side of the fence you sit on this post either demonstrates remarkable flexibility in the way you can get at data in a system,  or it will make you want to tear your hair out because there isn’t a pre-built interface for these reports in the system.  I’m of the camp that the way we’ve done things is a feature and not a bug but again many will have a different view.

How do you go about getting this data out of your systems?  Is the process much easier,  much harder or just about the same?

As always feel free to contact me via Twitter if you have questions or comments.

Extended Date Time Format (EDTF) use in the DPLA: Part 3, Date Patterns

 

Date Values

I wanted to take a look at the date values that had made their way into the DPLA dataset from the various Hubs.  The first thing that I was curious about was how many unique date strings are present in the dataset, it turns out that there are 280,592 unique date strings.

Here are the top ten date strings, their instance and then if the string is a valid EDTF string.

Date Value Instances Valid EDTF
[Date Unavailable] 183,825 FALSE
1939-1939 125,792 FALSE
1960-1990 73,696 FALSE
1900 28,645 TRUE
1935 – 1945 27,143 FALSE
1909 26,172 TRUE
1910 26,106 TRUE
1907 25,321 TRUE
1901 25,084 TRUE
1913 24,966 TRUE

It looks like “[Date Unavailable]” is a value used by the New York Public Library in denoting that an item does not have an available date.  It should be noted that NYPL also has 377,664 items in the DPLA that have no date value present at all,  so this isn’t a default behavior for items without a date.  Most likely it is practice within a single division that denotes unknown or missing dates this way.  The value “1939-1939” is used heavily by the University of Southern California. Libraries and seems to come from a single set of WPA Census Cards in their collection.  The value “1960-1990” is used primarily for the items in the J. Paul Getty Trust.

Date Length

I was also curious as to the length of the dates in the dataset.  I was sure that I would find large numbers of date strings that were four digits in length (1923), ten digits in length (1923-03-04) and other lengths for common highly used date formats.  I also figured that there would be instances of dates that were either less than four digits and also longer than one would expect for a date string.  Here are some example date strings for both.

Top ten date strings shorter than four characters

Date Value Instances
* 968
昭和3 521
昭和2 447
昭和4 439
昭和5 391
昭和9 388
昭和6 382
昭和7 366
大正4 323
昭和8 322

I’m not sure what “*” means for a date value,  but the other values seem to be Japanese versions of four digit dates (this is what google translate tells me).  There are 14,402 records that have date strings shorter than three characters and a total of 522 unique date strings present.

Top ten date strings longer than fifty characters.

Date Value Instances
Miniature repainted: 12th century AH/AD 18th (Safavid) 35
Some repainting: 13th century AH/AD 19th century (Safavid 25
11th century AH/AD 17th century-13th century AH/AD 19th century (Safavid (?)) 15
1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939 13
10th century AH/AD 16th century-12th century AH/AD 18th century (Ottoman) 10
late 11th century AH/AD 17th century-early 12th century AH/AD 18th century (Ottoman) 8
5th century AH/AD 11th century-6th century AH/AD 12th century (Abbasid) 7
4th quarter 8th century AH/AD 14th century (Mamluk) 5
L’an III de la République française … [1794-1795] 5
Began with 1st rept. (112th Congress, 1st session, published June 24, 2011) 3

There are 1,033 items with 894 unique values that are over fifty characters in length.  The longest is a “date string” 193 characters,  with a value of “chez W. Innys, J. Brotherton, R. Ware, W. Meadows, T. Meighan, J. & P. Knapton, J. Brindley, J. Clarke, S. Birt, D. Browne, T. Dongman, J. Shuckburgh, C. Hitch, J. Hodges, S. Austen, A. Millar,” which appears to be a mis-placement of another field’s data.

Here is the distribution of these items with date strings with fifty characters in length or more.

Hub Name Items with Date Strings 50 Characters or Longer
United States Government Printing Office (GPO) 683
HathiTrust 172
ARTstor 112
Mountain West Digital Library 31
Smithsonian Institution 25
University of Illinois at Urbana-Champaign 3
J. Paul Getty Trust 2
Missouri Hub 2
North Carolina Digital Heritage Center 2
Internet Archive 1

It seems that a large portion of these 50+ character date strings are present in the Government Printing Office records.

Date Patterns

Another way of looking at dates that I experimented with for this project was to convert a date string into what I’m calling a “date pattern”.  For this I take an input string, say “1940-03-22” and that would get mapped to 0000-00-00.  I convert all digits to zero,  all letters to the letter a and leave all characters that are not alpha-numeric.

Below is the function that I use for this.

def get_date_pattern(date_string):
    pattern = []
    if date_string is None:
        return None
    for c in date_string:
        if c.isalpha():
            pattern.append("a")
        elif c.isdigit():
            pattern.append("0")
        else:
            pattern.append(c)
    return "".join(pattern)

By applying this function to all of the date strings in the dataset I’m able to take a look at what overall date patterns (and also features) are being used throughout the dataset, and ignore the specific values.

There are a total of 74 different date patterns for date strings that are valid EDTF.   For those date strings that are not valid date strings,  there are a total of 13,643 date strings.  I’ve pulled the top ten date patterns for both valid EDTF and not valid EDTF date strings and presented them below.

Valid EDTF Date Patterns

Valid EDTF Date Pattern Instances Example
0000 2,114,166 2004
0000-00-00 1,062,935 2004-10-23
0000-00 107,560 2004-10
0000/0000 55,965 2004/2010
0000? 13,727 2004?
[0000-00-00..0000-00-00] 4,434 [2000-02-03..2001-03-04]
0000-00/0000-00 4,181 2004-10/2004-12
0000~ 3,794 2003~
0000-00-00/0000-00-00 3,666 2003-04-03/2003-04-05
[0000..0000] 3,009 [1922..2000]

You can see that the basic date formats yyyy, yyyy-mm-dd, and yyyy-mm very popular in the dataset.  Following that intervals are used in the format of yyyy/yyyy and uncertain dates with yyyy?.

 Non-Valid EDTF Date Patterns

Non-Valid EDTF Date Pattern Instances Example
0000-0000 1,117,718 2005-2006
00/00/0000 486,485 03/04/2006
[0000] 196,968 [2006]
[aaaa aaaaaaaaaaa] 183,825 [Date Unavailable]
00 aaa 0000 143,423 22 Jan 2006
0000 – 0000 134,408 2000 – 2005
0000-aaa-00 116,026 2003-Dec-23
0 aaa 0000 62,950 3 Jan 2000
0000] 58,459 1933]
aaa 0000 43,676 Jan 2000

Many of the date strings that are represented by these dates have the possibility of being “cleaned up” by simple transforms if that was of interest.  I would imagine that converting the 0000-0000 to 0000/0000 would be a fairly lossless transform that would suddenly change over a million items so that they are valid EDTF. Converting the format 00/00/0000 to 0000-00-00 is also a straight-forward transform if you know if 00-00 is mm-dd (US) or dd-mm (non-US). Removing the brackets around four digit years [0000] seems to be another easy fix to convert a large number of dates.  Of the top ten non-valid EDTF Date Patterns,  it might be possible to convert nine of them with simple transformations to become valid EDTF date strings.  This would give the DPLA 2,360,113 additional dates that are valid EDTF date strings.  The values for the date pattern [aaaa aaaaaaaaaaa] with a date string value of [Date Unavailable] might benefit from being removed from the dataset altogether in order to reduce some of the noise in the field.

Common Patterns Per Hub

One last thing that I wanted to do was to see i there are any commonalities between the hubs when you look at their most frequently used date patterns.  Below I’ve created tables for both valid EDTF date patterns and non-valid EDTF date patterns.

Valid EDTF Patterns

Hub Name Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5
ARTstor 0000 0000-00 0000? 0000/0000 0000-00-00
Biodiversity Heritage Library 0000 -0000 0000/0000 0000-00 0000?
David Rumsey 0000
Digital Commonwealth 0000-00-00 0000-00 0000 0000-00-00a00:00:00a
Digital Library of Georgia 0000-00-00 0000-00 0000/0000 0000 0000-00-00/0000-00-00
Harvard Library 0000 00aa 000a aaaa
HathiTrust 0000 0000-00 0000? -0000 00aa
Internet Archive 0000 0000-00-00 0000-00 0000? 0000/0000
J. Paul Getty Trust 0000 0000?
Kentucky Digital Library 0000
Minnesota Digital Library 0000 0000-00-00 0000? 0000-00 0000-00-00?
Missouri Hub 0000-00-00 0000 0000-00 0000/0000 0000?
Mountain West Digital Library 0000-00-00 0000 0000-00 0000? 0000-00-00a00:00:00a
National Archives and Records Administration 0000 0000?
North Carolina Digital Heritage Center 0000-00-00 0000 0000-00 0000/0000 0000?
Smithsonian Institution 0000 0000? 0000-00-00 0000-00 00aa
South Carolina Digital Library 0000-00-00 0000 0000-00 0000?
The New York Public Library 0000-00-00 0000-00 0000 -0000 0000-00-00/0000-00-00
The Portal to Texas History 0000-00-00 0000 0000-00 [0000-00-00..0000-00-00] 0000~
United States Government Printing Office (GPO) 0000 0000? aaaa -0000 [0000, 0000]
University of Illinois at Urbana-Champaign 0000 0000-00-00 0000? 0000-00
University of Southern California. Libraries 0000-00-00 0000/0000 0000 0000-00 0000-00/0000-00
University of Virginia Library 0000-00-00 0000 0000-00 0000? 0000?-00

I tried to color code the five most common EDTF date patterns from above in the following image.

Color-coded date patterns per Hub.

Color-coded date patterns per Hub.

I’m not sure if that makes it clear or not where the common date patterns fall or not.

Non Valid EDTF Patterns

Hub Name Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5
ARTstor 0000-0000 aa. 0000 aaaaaaa 0000a aa. 0000-0000
Biodiversity Heritage Library 0000-0000 0000 – 0000 0000- 0000-00 [0000-0000]
David Rumsey
Digital Commonwealth 0000-0000 aaaaaaa 0000-00-00-0000-00-00 0000-00-0000-00 0000-0-00
Digital Library of Georgia 0000-0000 0000-00-00 0000-00- 00 aaaaa 0000 0000a
Harvard Library 0000a-0000a a. 0000 0000a 0000-0000 0000 – a. 0000
HathiTrust [0000] 0000-0000 0000] [a0000] a0000
Internet Archive 0000-0000 0000-00 0000- [0—] [0000]
J. Paul Getty Trust 0000-0000 a. 0000-0000 a. 0000 [000-] [aa. 0000]
Kentucky Digital Library
Minnesota Digital Library 0000 – 0000 0000-00 – 0000-00 0000-0000 0000-00-00 – 0000-00-00 0000 – 0000?
Missouri Hub a0000 0000-00-00 aaaaaaaa 00, 0000 aaaaaaa 00, 0000 aaaaaaaa 0, 0000
Mountain West Digital Library 0000-0000 aa. 0000-0000 aa. 0000 0000? – 0000? 0000 aa
National Archives and Records Administration 00/00/0000 00/0000 a’aa. 0000′-a’aa. 0000′ a’00/0000′-a’00/0000′ a’00/00/0000′-a’00/00/0000′
North Carolina Digital Heritage Center 0000-0000 00000000 00000000-00000000 aa. 0000-0000 aa. 0000
Smithsonian Institution 0000-0000 00 aaa 0000 0000-aaa-00 0 aaa 0000 aaa 0000
South Carolina Digital Library 0000-0000 0000 – 0000 0000- 0000-00-00 0000-0-00
The New York Public Library 0000-0000 [aaaa aaaaaaaaaaa] 0000 – 0000 0000-00-00 – 0000-00-00 0000-
The Portal to Texas History a. 0000 [0000] 0000 – 0000 [aaaaaaa 0000 aaa 0000] a.0000 – 0000
United States Government Printing Office (GPO) [0000] 0000-0000 [0000?] aaaaa aaaa 0000 00aa-0000
University of Illinois at Urbana-Champaign 0-00-00 a. 0000 00/00/00 0-0-00 00-00-00
University of Southern California. Libraries 0000-0000 aaaaa 0000/0000 aaaaa 0000-00-00/0000-00-00 0000a aaaaa 0000-0000
University of Virginia Library aaaaaaa aaaa a0000 aaaaaaa 0000 aaa 0000? aaaaaaa 0000 aaa 0000 00–?

With the non-valid EDTF Date Patterns you can see where some of the date patterns are much more common across the various Hubs than others.

I hope you have found these posts interesting.  If you’ve worked with metadata, especially aggregated metadata you will no doubt recognize much of this from your datasets,  if you are new to this area or haven’t really worked with the wide range of date values that you can come in contact with in large metadata collections, have no fear,  it is getting better.  The EDTF is a very good specification for cultural heritage institutions to adopt for their digital collections.  It helps to provide both a machine and human readable format for encoding and notating the complex dates we have to work with in our field.

If there is another field that you would like me to take a look at in the DPLA dataset,  please let me know.

As always feel free to contact me via Twitter if you have questions or comments.

 

Extended Date Time Format (EDTF) use in the DPLA: Part 2, EDTF use by Hub

This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America.  For more background on this topic take a look at the first post in this series.

EDTF Use by Hub

In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system.  Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.

Looking overall at the dates that were present,  there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.

One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.

Hub Name Items With Date % of total items with date present Valid EDTD Valid EDTF % Not Valid EDTF Not Valid EDTF %
ARTstor 49,908 88.6% 26,757 53.6% 23,151 46.4%
Biodiversity Heritage Library 29,000 21.0% 22,734 78.4% 6,266 21.6%
David Rumsey 48,132 100.0% 48,132 100.0% 0 0.0%
Digital Commonwealth 118,672 95.1% 14,731 12.4% 103,941 87.6%
Digital Library of Georgia 236,961 91.3% 188,263 79.4% 48,687 20.5%
Harvard Library 6,957 65.8% 6,910 99.3% 47 0.7%
HathiTrust 1,881,588 98.2% 1,295,986 68.9% 585,598 31.1%
Internet Archive 194,454 93.1% 185,328 95.3% 9,126 4.7%
J. Paul Getty Trust 92,494 99.8% 6,319 6.8% 86,175 93.2%
Kentucky Digital Library 87,061 68.1% 87,061 100.0% 0 0.0%
Minnesota Digital Library 39,708 98.0% 33,201 83.6% 6,507 16.4%
Missouri Hub 34,742 83.6% 32,192 92.7% 2,550 7.3%
Mountain West Digital Library 634,571 73.1% 545,663 86.0% 88,908 14.0%
National Archives and Records Administration 553,348 78.9% 10,218 1.8% 543,130 98.2%
North Carolina Digital Heritage Center 214,134 82.1% 163,030 76.1% 51,104 23.9%
Smithsonian Institution 675,648 75.3% 44,860 6.6% 630,788 93.4%
South Carolina Digital Library 52,328 68.9% 42,128 80.5% 10,200 19.5%
The New York Public Library 791,912 67.7% 47,257 6.0% 744,655 94.0%
The Portal to Texas History 424,342 88.8% 416,835 98.2% 7,505 1.8%
United States Government Printing Office (GPO) 148,548 99.9% 17,894 12.0% 130,654 88.0%
University of Illinois at Urbana-Champaign 14,273 78.8% 11,304 79.2% 2,969 20.8%
University of Southern California. Libraries 269,880 89.6% 114,293 42.3% 155,573 57.6%
University of Virginia Library 26,072 86.4% 21,798 83.6% 4,274 16.4%

Turning this into a graph helps things show up a bit better.

EDTF info for each of the DPLA Hubs

EDTF info for each of the DPLA Hubs

There are a number of things that can be teased out of here,  first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already,  notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%.  Harvard at 99% and the Portal to Texas History at 98% are also notable.  On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid,  New York Public Library with 94%, and the J Paul Getty Trust at 93%.

Use of EDTF Level Features

The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2.  Level 0 are the basic date features such as date, date and time, and intervals.  Level 1 adds features like
Uncertain/Approximate dates,  Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates,  partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits.  Finally Level 2 lets you qualify seasons.  For a full list of the features please take a look at the draft specification at the Library of Congress.

When I was preparing the dataset I also tested the dates to see which feature level they matched to.  After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here.  Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2,  I was able to come up with usable data for further analysis.  Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.

Hub Name valid EDTF items valid-level0 % Level0 valid-level1 % Level1 valid-level2 % Level2
ARTstor 26,757 26,726 99.9% 31 0.1% 0 0.0%
Biodiversity Heritage Library 22,734 22,702 99.9% 32 0.1% 0 0.0%
David Rumsey 48,132 48,132 100.0% 0 0.0% 0 0.0%
Digital Commonwealth 14,731 14,731 100.0% 0 0.0% 0 0.0%
Digital Library of Georgia 188,274 188,274 100.0% 0 0.0% 0 0.0%
Harvard Library 6,910 6,822 98.7% 83 1.2% 5 0.1%
HathiTrust 1,295,990 1,292,079 99.7% 3,662 0.3% 249 0.0%
Internet Archive 185,328 185,115 99.9% 212 0.1% 1 0.0%
J. Paul Getty Trust 6,319 6,308 99.8% 11 0.2% 0 0.0%
Kentucky Digital Library 87,061 87,061 100.0% 0 0.0% 0 0.0%
Minnesota Digital Library 33,201 26,055 78.5% 7,146 21.5% 0 0.0%
Missouri Hub 32,192 32,190 100.0% 2 0.0% 0 0.0%
Mountain West Digital Library 545,663 542,388 99.4% 3,274 0.6% 1 0.0%
National Archives and Records Administration 10,218 10,003 97.9% 215 2.1% 0 0.0%
North Carolina Digital Heritage Center 163,030 162,958 100.0% 72 0.0% 0 0.0%
Smithsonian Institution 44,860 44,642 99.5% 218 0.5% 0 0.0%
South Carolina Digital Library 42,128 42,079 99.9% 49 0.1% 0 0.0%
The New York Public Library 47,257 47,251 100.0% 6 0.0% 0 0.0%
The Portal to Texas History 416,838 402,845 96.6% 6,302 1.5% 7,691 1.8%
United States Government Printing Office (GPO) 17,894 16,165 90.3% 875 4.9% 854 4.8%
University of Illinois at Urbana-Champaign 11,304 11,275 99.7% 29 0.3% 0 0.0%
University of Southern California. Libraries 114,307 114,307 100.0% 0 0.0% 0 0.0%
University of Virginia Library 21,798 21,558 98.9% 236 1.1% 4 0.0%

Looking at the top 25% of the data,  you get the following.

EDTF Level Use by Hub

EDTF Level Use by Hub

Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.

There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records.  The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).

I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.

Feel free to contact me via Twitter if you have questions or comments.

Extended Date Time Format (EDTF) use in the DPLA: Part 1

I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA,  I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.

A Bitter Harvest

One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting.  One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers.  This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.

One thing that came out of many of these aggregation projects,  and something that many of us are dealing with today is the fact that “dates are hard”.

Extended Date Time Format

A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.

An example of what this documented was how to represent some of the following date concepts in a machine readable way.

Commonly Used Dates

 Date Feature Example Item Format Example Date
Year Book with publication year YYYY 1902
Month Monthly journal issue YYYY-MM 1893-05
Day Letter YYYY-MM-DD 1924-03-03
Time Born-digital photo YYYY-MM-DDTHH:MM:SS 2003-12-27T11:09:08
Interval Compiled court documents YYYY/YYYY 1887/1889
Season Seasonal magazine issue YYYY-SS 1957-23
Decade WWII poster YYYu 194u
Approximate Map “circa 1886” YYYY~ 1886~

Some Complex Dates

Example Item Kind of Date Format Example Date
Photo taken at some point during an event August 6-9, 1992 One of a Set [YYYY..YYYY] [1992-08-06..1992-08-09]
Hand-carved object, “circa 1870s” Extended Interval (L1) YYYY~/YYYY~ 1870~/1879~
Envelope with a partially-legible postmark Unspecified “u” in place of digit(s) 18uu-08-1u
Map possibly created in 1607 or 1630 One of a Set, Uncertain [YYYY, YYYY] [1607?, 1630?]

The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).

Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time.  As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.

EDTF use in the DPLA

When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed.  In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties.  In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.

Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF,  I had the following questions in mind when I started the analysis presented in this post and a few that will follow.

  • How many date values in the DPLA are valid EDTF values?
  • How are these valid EDTF values distributed across the Hubs?
  • What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
  • What are the most common date format patterns used in the DPLA?

With these questions in mind I started the analysis

Preparing the Dataset

I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.

I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid.  Additionally I tested which level of EDTF each value conformed to.  Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.

This resulted in the following fields being indexed for each date

Field Value
date 2014-04-04
date_valid_edtf true
date_level0_feature true
date_level1_feature false
date_level2_feature false
date_pattern 0000-00-00

For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.

After several hours of indexing these values into Solr,  I was able to start answering some of the questions mentioned above.

Date usage in the DPLA

The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates.  Of the 8,012,390 items in my copy of the DPLA dataset,  6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.

I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.

Hub Name Items Items With Date Items With Date % Items Missing Date Items Missing Date %
ARTstor 56,342 49,908 88.6% 6,434 11.4%
Biodiversity Heritage Library 138,288 29,000 21.0% 109,288 79.0%
David Rumsey 48,132 48,132 100.0% 0 0.0%
Digital Commonwealth 124,804 118,672 95.1% 6,132 4.9%
Digital Library of Georgia 259,640 236,961 91.3% 22,679 8.7%
Harvard Library 10,568 6,957 65.8% 3,611 34.2%
HathiTrust 1,915,159 1,881,588 98.2% 33,571 1.8%
Internet Archive 208,953 194,454 93.1% 14,499 6.9%
J. Paul Getty Trust 92,681 92,494 99.8% 187 0.2%
Kentucky Digital Library 127,755 87,061 68.1% 40,694 31.9%
Minnesota Digital Library 40,533 39,708 98.0% 825 2.0%
Missouri Hub 41,557 34,742 83.6% 6,815 16.4%
Mountain West Digital Library 867,538 634,571 73.1% 232,967 26.9%
National Archives and Records Administration 700,952 553,348 78.9% 147,604 21.1%
North Carolina Digital Heritage Center 260,709 214,134 82.1% 46,575 17.9%
Smithsonian Institution 897,196 675,648 75.3% 221,548 24.7%
South Carolina Digital Library 76,001 52,328 68.9% 23,673 31.1%
The New York Public Library 1,169,576 791,912 67.7% 377,664 32.3%
The Portal to Texas History 477,639 424,342 88.8% 53,297 11.2%
United States Government Printing Office (GPO) 148,715 148,548 99.9% 167 0.1%
University of Illinois at Urbana-Champaign 18,103 14,273 78.8% 3,830 21.2%
University of Southern California. Libraries 301,325 269,880 89.6% 31,445 10.4%
University of Virginia Library 30,188 26,072 86.4% 4,116 13.6%
Presence of Dates by Hub Name

Presence of Dates by Hub Name

I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA,  the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library.  There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA.  This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)

When you look at Content-Hubs vs Service-Hubs you see the following.

Hub Type Items Items With Date Items With Date % Items Missing Date Items Missing Date %
Content-Hub 5,736,178 4,782,214 83.4% 953,964 16.6%
Service-Hub 2,276,176 1,842,519 80.9% 433,657 19.1%

It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.

Valid EDTF Dates

I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification.  It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.

EDTF Valid vs Not Valid

EDTF Valid vs Not Valid

So the split is pretty close.

One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 6 – Average Edit Duration by Facet

This is the sixth post in a series of posts related to metadata edit events for the UNT Libraries’ Digital Collections from January 1, 2014 to December 31, 2014.  If you are interested in the previous posts in this series,  they talked about the when, what, who, duration based on time buckets and finally calculating the average edit event time.

In the previous post I was able to come up with what I’m using as the edit event duration ceiling for the rest of this analysis.  This means that the rest of the analysis in this post will ignore the events that took longer than 2,100 seconds this leaves us with 91,916 (or 97.6% of the original dataset) valid events to analyze after removing 2,306 that had a duration of over 2,100.

Editors

The table below is the user stats for our top ten editors once I’ve ignored items over 2,100 seconds.

username                                    min max edit events duration sum mean stddev
htarver 2 2,083 15,346 1,550,926 101.06 132.59
aseitsinger 3 2,100 9,750 3,920,789 402.13 437.38
twarner 5 2,068 4,627 184,784 39.94 107.54
mjohnston 3 1,909 4,143 562,789 135.84 119.14
atraxinger 3 2,099 3,833 1,192,911 311.22 323.02
sfisher 5 2,084 3,434 468,951 136.56 241.99
cwilliams 4 2,095 3,254 851,369 261.64 340.47
thuang 4 2,099 3,010 770,836 256.09 397.57
mphillips 3 888 2,669 57,043 21.37 41.32
sdillard 3 2,052 2,516 1,599,329 635.66 388.3

You can see that many of these users have very short edit times for their lowest edits and all but one have edit times for the maximum that approach the duration ceiling.  The average amount of time spent per edit event ranges from 21 seconds to 10 minutes and 35 seconds.

I know that for user mphillips (me) the bulk of the work I tend to do in the edit system is fixing quick mistakes like missing language codes, editing dates that aren’t in Extended Data Time Format (EDTF) or hiding and un-hiding records.  Other users such as sdillard have been working exclusively on a project to create metadata for a collection of Texas Patents that we are describing in the Portal.

 Collections

The top ten most edited collections and their statistics are presented below.

Collection Code Collection Name min max edit events duration sum mean stddev
ABCM Abilene Library Consortium 2 2,083 8,418 1,358,606 161.39 240.36
JBPC Jim Bell Texas Architecture Photograph Collection 3 2,100 5,335 2,576,696 482.98 460.03
JJHP John J. Herrera Papers 3 2,095 4,940 1,358,375 274.97 346.46
ODNP Oklahoma Digital Newspaper Program 5 2,084 3,946 563,769 142.87 243.83
OKPCP Oklahoma Publishing Company Photography Collection 4 2,098 5,692 869,276 152.72 280.99
TCO Texas Cultures Online 3 2,095 5,221 1,406,347 269.36 343.87
TDNP Texas Digital Newspaper Program 2 1,989 7,614 1,036,850 136.18 185.41
TLRA Texas Laws and Resolutions Archive 3 2,097 8,600 1,050,034 122.1 172.78
TXPT Texas Patents 2 2,099 6,869 3,740,287 544.52 466.05
TXSAOR Texas State Auditor’s Office: Reports 3 1,814 2,724 428,628 157.35 142.94
UNTETD UNT Theses and Dissertations 5 2,098 4,708 1,603,857 340.67 474.53
UNTPC University Photography Collection 3 2,096 4,408 1,252,947 284.24 340.36

This data is a little easier to see with a graph.

Average edit duration per collection

Average edit duration per collection

Here is my interpretation of what I see in these numbers based on personal knowledge of these collections.

The collections with the highest average duration are the TXPT and JBPC collection,  these are followed by the UNTETD, UNTPC, TCP and JJHP collections.  The first two (Texas Patents (TXPT) and Jim Bell Texas Architecture Photograph Collection (JBPC) are example of collections that were having metadata records created for the first time via our online editing system.  These collections generally required more investigation (either by reading the patent or researching the photograph) and therefore took more time on average to create the records.

Two of the others, the UNT Theses and Dissertation Collection (UNTETD) and the UNT Photography Collection (UNTPC) involved an amount of copy cataloging for the creation of the metadata either from existing MARC records or local finding aids.  TheJohn J. Herrera Papers (JJHP) involved,  I believe,  a working with an existing finding aid,  and I know that there was a two step process of creating the record,  and then publishing it as unhidden in a different event,  therefore lowering the average time considerably.  I don’t know that much about the Texas Cultures Online (TCO) work in 2014 to be able to comment there.

On the other end of of the spectrum you have collections like ABCM, ODNP, OKPCP, and TDNP that were projects that averaged a much shorter amount of time on records.  For these there were many small edits to the records that were typically completed one field at a time.  For some of these it might have just involved fixing a consistent typo,  adding the record to a collection or hiding or un-hiding it from public view.

This raises a question for me,  is it possible to detect the “kind” of edits that are being made based on their average edit times?  That’s something to look at.

Partner Institutions

And now the ten partner institutions that had the most metadata edit events.

Partner Code Partner Name min max edit events duration sum mean stddev
UNTGD UNT Libraries Government Documents Department 2 2,099 21,342 5,385,000 252.32 356.43
OKHS Oklahoma Historical Society 4 2,098 10,167 1,590,498 156.44 279.95
UNTA UNT Libraries Special Collections 3 2,099 9,235 2,664,036 288.47 362.34
UNT UNT Libraries 2 2,098 6,755 2,051,851 303.75 458.03
PCJB Private Collection of Jim Bell 3 2,100 5,335 2,576,696 482.98 460.03
HMRC Houston Metropolitan Research Center at Houston Public Library 3 2,095 5,127 1,397,368 272.55 345.62
HPUL Howard Payne University Library 2 1,860 4,528 544,420 120.23 113.97
UNTCVA UNT College of Visual Arts + Design 4 2,098 4,169 1,015,882 243.68 364.92
HSUL Hardin-Simmons University Library 3 2,020 2,706 658,600 243.39 361.66
HIGPL Higgins Public Library 2 1,596 1,935 131,867 68.15 118.5

Again presented as a simple chart.

Average edit duration per partner.

Average edit duration per partner.

It is easy to see the difference between the Private Collection of Jim Bell (PCJB) with an average of 482 seconds or roughly 8 minutes per edit and the Higgins Public Library (HIGPL)  which had an average of 68 seconds, or just over one minute.  In the first case with the Private Collection of Jim Bell (PCJB),  we were active in creating records for the first time for these items and the average of eight minutes seems to track with what one would imagine it takes to create a metadata record for a photograph.  The Higgins Public Library (HIGPL) collection is a newspaper collection that had a single change in the physical description made to all of the items in that partner’s collection.  Other partners between these two extremes and have similar characteristics with the lower edit averages happening for partner’s content that is either being edited in a small way, hidden or un-hidden from view.

Resource Type

The final way we will slice the data for this post is by looking at the stats for the top ten resource types.

resource type min max count sum mean stddev
image_photo 2 2,100 30,954 7,840,071 253.28 356.43
text_newspaper 2 2,084 11,546 1,600,474 138.62 207.3
text_leg 3 2,097 8,604 1,050,103 122.05 172.75
text_patent 2 2,099 6,955 3,747,631 538.84 466.25
physical-object 2 2,098 5,479 1,102,678 201.26 326.21
text_etd 5 2,098 4,713 1,603,938 340.32 474.4
text 3 2,099 4,196 1,086,765 259 349.67
text_letter 4 2,095 4,106 1,118,568 272.42 326.09
image_map 3 2,034 3,480 673,707 193.59 354.19
text_report 3 1,814 3,339 465,168 139.31 145.96
Average edit duration for the top ten resource types

Average edit duration for the top ten resource types

The resource type that really stands out in this graph is the text_patents at 538 seconds per record.  These items belong to the Texas Patent Collection and they were loaded into the system with very minimal records and we have been working to add new metadata to these resources.  The almost ten minutes per record seems to be very standard for the amount of work that is being done with the records.

The text_leg collection is one that I wanted to take another quick look at.

If we calculate the statistics for the users that edited records in this collection we get the following data.

username                                    min max count sum mean stddev
bmonterroso 3 1,825 890 85,254 95.79 163.25
htarver 9 23 5 82 16.4 5.64
mjohnston 3 1,909 3,309 329,585 99.6 62.08
mphillips 5 33 30 485 16.17 7.68
rsittel 3 1,436 654 22,168 33.9 88.71
tharden 3 2,097 1,143 213,817 187.07 241.2
thuang 4 1,812 2,573 398,712 154.96 227.7

Again you really see it with the graph.

Average edit duration for users who edited records that were the text_leg resource type

Average edit duration for users who edited records that were the text_leg resource type

In this you see that there were a few users (htarver, mphillips, rsittel) who brought down the average duration because they had very quick edits while the rest of the editors either averaged right around 100 seconds per edit average or around two minutes per edit average.

I think that there is more to do with these numbers,  I think calculating the average total duration for a given metadata record in the system as edits are performed on it will be something of interest for a later post. So check back for the next post in this series.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 5 – Identifying an average metadata editing time.

This is the fifth post in a series of posts related to metadata edit events for the UNT Libraries’ Digital Collections from January 1, 2014 to December 31, 2014.  If you are interested in the previous posts in this series,  they talked about the when, what, who, and first steps of duration.

In this post we are going to try and come up with the “average” amount of time spent on metadata edits in the dataset.

The first thing I wanted to do was to figure out which of the values mentioned in the previous post about duration buckets I could ignore as noise in the dataset.

As a reminder the duration data for metadata edit events is started when a user opens a metadata record in the edit system, and finished when they submit the record back to the system as a publish event.  The duration is the difference in seconds of those two time timestamps.

There are a number of factors that can cause the duration data to vary wildly,  a user can have a number of tabs open at the same time while only working on one of them.  They may open a record and then walk off without editing that record.  They could also be using a browser automation tool like Selenium that automates the metadata edits and therefore pushes the edit time down considerably.

In doing some tests of my own editing skills it isn’t unreasonable to have edits that are four or five seconds in duration if you are going in to change a known value from a simple dropdown. For example adding a language code to a photograph that you know should be “no-language” doesn’t take much time at all.

My gut feeling based on the data in the previous post was to say that edits that have a duration of over one hour should be considered outliers.  This would remove 844 events from the total 94,222 edit events leaving me 93,378 (99%) of the events.  This seemed like a logical first step but I was curious if there were other ways of approaching this.

I had a chat with the UNT Libraries’ Director of Research & Assessment Jesse Hamner and he suggested a few methods for me to look at.

IQR for calculating outliers

I took a stab at using the Interquartile Range of the dataset as the basis for identifying the outliers.  With a little bit of R I was able to find the following information about the duration dataset.

 Min.   :     2.0  
 1st Qu.:    29.0  
 Median :    97.0  
 Mean   :   363.8  
 3rd Qu.:   300.0  
 Max.   :431644.0  

With that I have Q1 of 29 and a Q3 of 300,  this gives me an IQR of 271.

So the range for outliers is Q1–1.5 × IQR  for the low end and Q3+1.5 × IQR on the high end.

With the numbers that says that values under -377.5 or over 706.5 should be considered outliers.

Note: I’m pretty sure there are some different ways of dealing IQR and datasets that end at Zero so that’s something to investigate.

For me the key here is that I’ve come up with 706.5 seconds being the ceiling for a valid event duration based on this method.  Thats 11 minutes and 47 seconds.  If I limit the dataset to edit events that are under 707 seconds  I am left with 83,239 records.  That is now just 88% of the dataset with 12% being considered an outlier.   I thought this seemed to be too many records to ignore so after talking with my resident expert in the library I had a new method.

Two Standard Deviations

I took a look at what the timings would look look like if i based my outliers on the standard deviations.  Edit events that are under 1,300 seconds (21 min 40 sec) in duration amount to 89,547 which is 95% of the values in the dataset.  I also wanted to see what 2.5% of the dataset would look like.  Edit durations under 2,100 seconds (35 minutes) result in 91,916 usable edit events for calculations which is right at 97.6%.

Comparing the methods

The following table takes the four duration ceilings that I tried. (IQR, 95 and 97.5, and gut feeling one hour) and makes them a bit more readable. The total number of duration events in the dataset before limiting is 94,222.

Duration Ceiling Events Remaining Events Removed % remaining
707 83,239 10,983 88%
1,300 89,547 4,675 95%
2,100 91,916 2,306 97.6%
3,600 93,378 844 99%

Just for kicks I calculated the average time spent on editing records across the datasets that remained for the various cutoffs to get an idea how the ceilings changed things.

Duration Ceiling Events Included Events Ignored Mean Stddev Sum Average Edit Duration Total Edit Hours
707 83,239 10,983 140.03 160.31 11,656,340 2:20 3,238
1,300 89,547 4,675 196.47 260.44 17,593,387 3:16 4,887
2,100 91,916 2,306 233.54 345.48 21,466,240 3:54 5,963
3,600 93,378 844 272.44 464.25 25,440,348 4:32 7,067
431,644 94,222 0 363.76 2311.13 34,274,434 6.04 9,521

In the table above you can see how the different duration ceilings do to the data analyzed.  I calculated the mean of the various datasets,  and their standard deviations (really Solr statsComponent did that).  I converted those Means into minutes and seconds in the “Average Edit Duration” column and the final column is the number of person hours that were spent editing metadata in 2014 based on the various datasets.

In going forward I will be using 2,100 seconds as my duration ceiling and ignoring the edit events that took longer than that period of time.  I’m going to do a little work in figuring out the costs associated with metadata creation in our collections for the last year.  So check back for the next post in this series.

As always feel free to contact me via Twitter if you have questions or comments.