User Session Analysis: Connections Between Collections, Type, Institutions

I’ve been putting off some analysis that a few of us at the UNT Libraries have wanted to do with the log files of the UNT Libraries Digital Collections.  This post (and probably a short series to follow) is an effort to get back on track.

There are three systems that we use to provide access to content and those include: The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.

In our digital collections there are a few things that we’ve said over time that we feel very strongly about but which we’ve never really measured.  First off we have said that there is value in co-locating all of our content in the same fairly uniform system instead of building visually and functionally distinct systems for different collections of items.  So instead of each new project or collection going into a new system, we’ve said there is not only cost savings, but real value in putting them all together in a single system.  We’ve said “there is an opportunity for users to not only find content from your collection, but they could find useful connections to other items in the overall digital library”.

Another thing we’ve said is that there is value in putting all different types of digital objects together into our digital systems.  We put the newspapers, photographs, maps, audio, video, and datasets together and we think there is value in that.  We’ve said that users will be able to find newspaper issues, photographs, and maps that might meet their need.  If we had a separate newspaper system, separate video or audio system some of this cross-type discovery would never take place.

Finally we’ve said that there is great value in locating collections from many institutions together in a system like The Portal to Texas History.  We thought (and still think) that users would be able to do a search and it will pull resources together from across institutions in Texas that have matching resources. Because of the geography of the state, you might be finding things that are physically located 10 or 12 hours away from each other at different institutions. In the Portal, these could be displayed together, something that would be challenging if they weren’t co-located in a system.

In our mind these aren’t completely crazy concepts but we do run into other institutions and practitioner that don’t always feel as strongly about this as we do.  The one thing that we’ve never done locally is look at the usage data of the systems and find out:

  • Do users discover and use items from different collections?
  • Do users discover and use items that are different types?
  • Do users discover and use items that are from different contributing partners?

This blog post is going to be the first in a short series that takes a  look at the usage data in the UNT Libraries Digital Collections in an attempt to try and answer some of these questions.

Hopefully that is enough background, now let’s get started:

How to answer the questions.

In order to get started we had to think a little bit about how we wanted to pull together data on this.  We have been generating item-based usage for the digital library collections for a while.  These get aggregated into collection and partner statistics that we make available in the different systems.  The problem with this data is that it just shows what items were used and how many times in a day they were used.  It doesn’t show what was used together.

We decided that we needed to go back to the log files from the digital collections and re-create user sessions to group item usage together.  After we have information about what items were used together we can sprinkle in some metadata about those items and start answering our questions.

With that as a plan we can move to the next step.

Preparing the Data

We decided to use all of the log files for 2017 from our digital collections servers.  This ends up being 1,379,439,042 lines of Apache access logs (geez, over 1.3 billion, or 3.7 million server requests a day).  The data came from two different servers that collectively host all of the application traffic for the three systems that make up the UNT Libraries’ Digital Collections.

We decided that we would define a session as all of the interactions that a single IP address has with the system in a 30 minute window.  If a user uses the system for more than 30 minutes, say 45 minutes, that would count as one thirty minute session and one fifteen minute session.

We started by writing a script that would do three things.  First it would ignore lines in the log file that were from robots and crawlers.  We have a pretty decent list of these bots so that was easy to remove.  Next we further reduced the data by only looking at digital object accesses.  Specifically lines that looked something like ‘/ark:/67531/metapth1000000/`. This pattern in our system denotes an item access and these are what we were interested in.  Finally we only were concerned with accesses that returned content so we only looked at lines that returned a 200 status code.

We filtered the log files down to three columns of data.  The first column was the timestamp for when the http access was made,  the second column was the has of the hashed IP address used to make the request, and the final column was the digital item path requested.  This resulted in a much smaller dataset to work with, from 1,379,439,042 down to 144,405,009 individual lines of data.

Here is what a snipped of data looks like

1500192934      dce4e45d9a90e4a031201b876a70ec0e  /ark:/67531/metadc11591/m2/1/high_res_d/Bulletin6869.pdf
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/43/high_res/
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/41/high_res/
1500192944      b63927e2b8817600aadb18d3c9ab1557  /ark:/67531/metadc33192/m2/1/high_res_d/dissertation.pdf
1500192945      accb4887d609f8ef307d81679369bfb0  /ark:/67531/metacrs10285/m1/1/high_res_d/RS20643_2006May24.pdf
1500192948      decabc91fc670162bad9b41042814080  /ark:/67531/metadc504184/m1/2/small_res/
1500192949      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/
1500192951      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/small_res/
1500192950      c8a320f38b3477a931fabd208f25c219  /ark:/67531/metadc1729/m1/9/med_res_d/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/med_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/3/small_res/
1500192953      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/2/small_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/4/small_res/
1500192955      67ef5c0798dd16cb688b94137b175f0b  /ark:/67531/metadc848614/m1/2/small_res/
1500192963      a19ce3e92cd3221e81b6c3084df2d4a6  /ark:/67531/metadc5270/m1/254/med_res/
1500192961      ea9ba7d064412a6d09ff708c6e95e201  /ark:/67531/metadc85867/m1/4/high_res/

You can see the three columns in the data there.

The next step was actually to sort all of this data by the timestamp in the first column.  You might notice that not all of the lines are in chronological order in the sample above.  By sorting on the timestamp, things will fall into order based on time.

The next step was to further reduce this data down into sessions.  We created a short script that we could feed the data into and it would keep track of the ip addresses it came across, note the objects that the ip hash used, and after a thirty minute period of time (based on the timestamp) it would start the aggregation again.

The result was a short JSON structure that looked like this.

  "arks": ["metapth643331", "metapth656112"],
  "ip_hash": "85ebfe3f0b71c9b41e03ead92906e390",
  "timestamp_end": 1483254738,
  "timestamp_start": 1483252967

This JSON has the ip hash, the starting and ending timestamp for that session, and finally the items that were used.  Each of these JSON structures were placed into a file, a line-oriented set of JSON “files” that would get used in the following steps.

This new line-oriented JSON file is 10,427,111 lines long, with one line representing a single user session for the UNT Libraries’ Digital Collections.  I think that’s pretty cool.

I think I’m going to wrap up this post but in the next post I will take a look at what these users sessions look like with a little bit of sorting, grouping, plotting, and graphing.

If you have questions or comments about this post,  please let me know via Twitter.