Calculating a Use.

In the previous post I talked a little about what a “use” is for our digital library and why we use it as a unit of measurement.  We’ve found that it is a very good indicator of engagement with our digital items.  It allows us to understand how our digital items are being used in ways that go beyond Google Analytics reports.

In this post I wanted to talk a little bit about how we calculate these metrics.

First of all a few things about the setup that we have for delivering content.

We have a set of application servers that are running an Apache/Python/Django stack for receiving and delivering requests for digital items.  A few years ago we decided that it would be useful to proxy all of our digital object content through theses content servers for delivery so we would have access to adjust and possibly restrict content in some situations.  This means two things,  one that all traffic goes in and out of these application servers for requests and delivery of content, and two that we are able to rely on the log files that Apache produces to get a good understanding of what is going on in our system.

We decided to base our metrics on best practices of the Counter initiative whenever possible as to try and align our numbers with that community.

Each night at about 1:00 AM CT we start a process that aggregates all of the log files for the previous day from the different application servers.  These are collocated on a server that is responsible for calculating the daily uses.

We are using the NCSA extended/combined log format for the log file format coming from our servers.  We also append the domain name for the request because we operate multiple interfaces and domains from the same servers.  A typical set of logs look something like this.

texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74680/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74678/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 68.180.228.104 - - [21/Dec/2014:03:49:28 -0600] "GET /search/?q=%22Bowles%2C+Flora+Gatlin%2C+1881-%22&t=dc_creator&display=grid HTTP/1.0" 200 17900 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74662/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7"
texashistory.unt.edu 68.229.222.120 - - [21/Dec/2014:03:49:28 -0600] "GET /ark:/67531/metapth74679/m1/1/high_res/?width=930 HTTP/1.0" 200 59858 "http://forum.skyscraperpage.com/showthread.php?t=173080&page=15" "Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0"

 

Here are the steps that we go through for calculating uses.

  • Remove requests that are not for an ARK identifier,  the request portion must start with /ark:/
  • Remove requests from known robots (this list, plus additional robots)
  • Remove requests with no user agent
  • Remove all requests that are not 200 or 304
  • Remove requests for thumbnails, raw metadata, feedback, urls from the object as they are generally noise

From the lines in the first example there would be only one line left after processing,  the last line.

The lines that remain are sorted by date and then grouped byIP Address.  A time window of 30 minutes is run on the requests.  The IP address and ARK identifier are used as the key to the time window which allows us to group request by a single IP Address for a single item into a single use.  These uses are then fed into a Django application where we store our stats data.

There are of course many other specifics about the process of getting these stats processed and moved into our system.  I hope this post was helpful for explaining the kinds of things that we do and doing count when calculating uses.