Monthly Archives: December 2014

A measure of metadata improvement in the UNT Libraries Digital Collections

The Portal to Texas History is working with the Digital Public Library of America as a Service Hub for the state of Texas.  As part of this work we are spending time working on a number of projects to improve the number and quality of metadata records in the Portal.

We have a number of student assistants within the UNT Libraries who are working to create metadata for items in our system that do not have complete metadata records.  In doing so we are able to make these items available to the general public.  I thought it might be interesting to write a bit about how we are measuring this work and showing that we are in fact making more content available.

What is a hidden record?

We have to kinds of records that get loaded into our digital library systems.  Records that are completely fleshed out and “live” and records that are minimal in nature and serve as a placeholder until the full record is created.  The minimal records almost always go into the system in a hidden state while the full records are most often loaded unhidden or published. There are situations where we load these full records into the system as hidden records but that is fairly rare.

How many hidden records?

When we started working on the Service Hubs project with DPLA we had 39,851 metadata records in the system that were hidden out of a total of 754,944 total metadata records.  This is about 5% of the records in the system in a hidden state.  

Why so many?

There are a few different categories that we can sort our hidden records into.  We have items that are missing full metadata records.  This accounts for the largest percentage of hidden records.  We also have records that belong to partner institutions around the state which most likely will never be completed because something on the partners end fell through before the metadata records were completed,  we generally call these orphaned metadata records.  We have items that for one reason or another are marked as “duplicate” and are waiting to be purged from the access system.  Finally we have items that are in an embargoed state in the system either because the rights owner for the item has an access embargo on the items, or because we haven’t been able to fully secure rights for the items yet.  Together this makes all of the hidden items in our system.  Unfortunately we currently don’t have a great way of differentiating between these different kinds of hidden record.

How are you measuring progress?

One of the metrics that we are using to establish that we are in fact reducing the number of hidden items in the system over time is to track the percentage of hidden records to total records in the system over time.  This gives us a way to show that we are making progress and continuing to reduce the ratio of hidden to unhidden records in the system.  The following table shows the current data we’ve been collecting for this since August 2014.

Date Total Hidden Percent Hidden
2014-08-04 754,944 39,851 5.278669676
2014-09-02 816,446 43,238 5.295879948
2014-10-14 907,816 38,867 4.281374199
2014-11-05 937,470 44,286 4.723991168
2014-12-14 1,014,890 41,264 4.065859354
2015-01-11 1,053,607 42,514 4.035090883

You see that even though we’ve had a few rises between the months we’ve been moving overall in a downward trend in the number of records that are hidden versus unhidden.  The dataset that is updated each month is available as a Google Drive Spreadsheet.

There are several projects that we have loaded in a hidden state over the past few months including over 7,000 Texas Patent records, 1,200 Texas State Auditors Reports and 3,000 photographs from a personal photograph collection.  These were all loaded in a hidden state which explains the large jumps in numbers.

Areas to improve.

One of the things that we have started to think about (but don’t have any solid answers) is a way of classifying the different states that a metadata record can have in our system so that we can have a better understanding of why items are hidden vs not hidden.  We recognize our simple hidden or unhidden designation is lacking.  I would be interested in knowing how others are approaching this sort of issue and if there is some sort of existing work to build upon.  If there is something out there do get in touch and let me know.

Calculating a Use.

In the previous post I talked a little about what a “use” is for our digital library and why we use it as a unit of measurement.  We’ve found that it is a very good indicator of engagement with our digital items.  It allows us to understand how our digital items are being used in ways that go beyond Google Analytics reports.

In this post I wanted to talk a little bit about how we calculate these metrics.

First of all a few things about the setup that we have for delivering content.

We have a set of application servers that are running an Apache/Python/Django stack for receiving and delivering requests for digital items.  A few years ago we decided that it would be useful to proxy all of our digital object content through theses content servers for delivery so we would have access to adjust and possibly restrict content in some situations.  This means two things,  one that all traffic goes in and out of these application servers for requests and delivery of content, and two that we are able to rely on the log files that Apache produces to get a good understanding of what is going on in our system.

We decided to base our metrics on best practices of the Counter initiative whenever possible as to try and align our numbers with that community.

Each night at about 1:00 AM CT we start a process that aggregates all of the log files for the previous day from the different application servers.  These are collocated on a server that is responsible for calculating the daily uses.

We are using the NCSA extended/combined log format for the log file format coming from our servers.  We also append the domain name for the request because we operate multiple interfaces and domains from the same servers.  A typical set of logs look something like this.

texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74680/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74678/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 68.180.228.104 - - [21/Dec/2014:03:49:28 -0600] "GET /search/?q=%22Bowles%2C+Flora+Gatlin%2C+1881-%22&t=dc_creator&display=grid HTTP/1.0" 200 17900 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74662/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 68.229.222.120 - - [21/Dec/2014:03:49:28 -0600] "GET /ark:/67531/metapth74679/m1/1/high_res/?width=930 HTTP/1.0" 200 59858 "http://forum.skyscraperpage.com/showthread.php?t=173080&page=15" "Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0"

 

Here are the steps that we go through for calculating uses.

  • Remove requests that are not for an ARK identifier,  the request portion must start with /ark:/
  • Remove requests from known robots (this list, plus additional robots)
  • Remove requests with no user agent
  • Remove all requests that are not 200 or 304
  • Remove requests for thumbnails, raw metadata, feedback, urls from the object as they are generally noise

From the lines in the first example there would be only one line left after processing,  the last line.

The lines that remain are sorted by date and then grouped byIP Address.  A time window of 30 minutes is run on the requests.  The IP address and ARK identifier are used as the key to the time window which allows us to group request by a single IP Address for a single item into a single use.  These uses are then fed into a Django application where we store our stats data.

There are of course many other specifics about the process of getting these stats processed and moved into our system.  I hope this post was helpful for explaining the kinds of things that we do and doing count when calculating uses.

 

What is a use?

One of the metrics that we use for the various digital library systems that we run at work is the idea of an item “use”.

This post will hopefully explain a bit more about how a use is calculated and presented.

The different digital library systems that we operate (The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History) make use of Google Analytics to log and report on access to these systems.  Below is a screenshot of the Google Analytics data for the last month related to The Portal to Texas History.

Google Analytics Screenshot

Google Analytics Screenshot for The Portal to Texas History

From Google Analytics we are able to get a rough idea of the number of users, sessions, and pageviews as well as a whole host of information that is important for running a large website like a digital library.

There are a number of features of Google Analytics that we can take advantage of that allow us to understand how users are interacting with our systems and interfaces.

One of the challenges we have with this kind of analytics is the fact that it collects information when triggered by Javascript on the page.  This can happen when the page is loaded or when something is clicked on the page.  The reason that this is sometimes not enough for our reporting is the fact that much of the content in our various digital libraries is linked to directly by outside resources,  either embedded in discussion forums or by directing users directly to the PDF representation of the item.

A few years ago we decided to start accounting for this kind of usage of our systems in addition to the data that Google Analytics provides.  In order to do this we developed a set of scripts that we run each night that work on the previous days worth of log files on the application servers that serve our digital library content.  These log files are aggregated to a single place,  parsed, and then filtered to leave us with the information we are interested in for the day.  This resulting data are the unique uses that an item has had from a given IP address during a 30 minute window.  This allows us to report on uses of theses and dissertations that may be linked to directly from a Google search result,  or possibly an image that was embedded in another sites blog post that pertains to one of our digital libraries.

Once we have the data for a given object we are able to aggregate that usage information to the collection and partner level for which the item belongs.  This allows us to show information about usage at the collection or partner level.  Finally the item use information is aggregated at the system level so that you can see the information for The Portal to Texas History, UNT Digital Library, or The Gateway to Oklahoma History in one place.

Item page in the UNT Digital Library

Item page in the UNT Digital Library

The above image shows how an end user can see the usage data for an item on the items about page.  This shows up in the “Usage” section which displays total usage, uses in the last 30 days, and then uses yesterday.

Usage Statistics for item in the UNT Digital Library

Usage Statistics for item in the UNT Digital Library

If a user clicks on the stats tab they are taken to the items stats page.  They can see the most recent 30 days or select from a month or year in the table below the graph.

Referral data for item in the UNT Digital Library

Referral data for item in the UNT Digital Library

A user can view the referral traffic for a selected month or year by clicking on the referral tab.

Collection Statistics for the UNT Scholarly Works Repository in the UNT Digital Library

Collection Statistics for the UNT Scholarly Works Repository in the UNT Digital Library

Each item use is also aggregated to the collection and partner level.

System statistics for the UNT Digital Library

System statistics for the UNT Digital Library

And finally a user is able to view statistics for the entire system.  At this time we have usage data for the systems going back to 2009 when we switched over to our current architecture.

I will probably write another post detailing the specifics of what we do and don’t count when we are calculating a “use”.  So more later.