Monthly Archives: February 2018

User Session Analysis: Investigating Sessions

In the previous post in this series I laid out the work that we were going to do with session data from the UNT Libraries’ Digital Collections.  In order to get the background that this post builds from take a quick look at that post.

In this post we are going to look at the data for the 10,427,111 user sessions that we generated from the 2017 Apache access logs from the UNT Libraries Digital Collections.

Items Per Sessions

The first thing that we will take a look at in the dataset is information about how many different digital objects or items are viewed during a session.

Items Accessed Sessions Percentage of All Sessions
1 8,979,144 86.11%
2 809,892 7.77%
3 246,089 2.36%
4 114,748 1.10%
5 65,510 0.63%
6 41,693 0.40%
7 29,145 0.28%
8 22,123 0.21%
9 16,574 0.16%
10 15,024 0.14%
11 10,726 0.10%
12 9,087 0.09%
13 7,688 0.07%
14 6,266 0.06%
15 5,569 0.05%
16 4,618 0.04%
17 4,159 0.04%
18 3,540 0.03%
19 3,145 0.03%
20-29 17,917 0.17%
30-39 5,813 0.06%
40-49 2,736 0.03%
50-59 1,302 0.01%
60-69 634 0.01%
70-79 425 0.00%
80-89 380 0.00%
90-99 419 0.00%
100-199 2,026 0.02%
200-299 411 0.00%
300-399 105 0.00%
400-499 63 0.00%
500-599 24 0.00%
600-699 43 0.00%
700-799 28 0.00%
800-899 20 0.00%
900-999 6 0.00%
1000+ 19 0.00%

I grouped the item uses per session in order to make the table a little easier to read.  With 86% of sessions being single item accesses that means we have 14% of the sessions that have more than one item access. This is still 1,447,967 sessions that we can look at in the dataset so not bad.

You can also see that there are a few sessions that have a very large number of items associated with them. For example there are 19 sessions that have over 1,000 items being used.  I would guess that this is some sort of script or harvester that is masquerading as a browser.

Here are some descriptive statistics for the items per session data.

N Min Median Max Mean Stdev
10,427,111 1 1 1,828 1.53 4.735

For further analysis we will probably restrict our sessions to those that have under 20 items used in a single session.  While this might remove some legitimate sessions that used a large number of items, it will give us numbers that we can feel a bit more confident about.  That will leave 1,415,596 or 98% of the sessions with more than one item used still in the dataset for further analysis.

Duration of Sessions

The next thing we will look at is the duration of sessions in the dataset.  We limited a single session to all interactions by an IP address in a thirty minute window so that gives us the possibility of sessions up to 1,800 seconds.

Minutes Sessions Percentage of Sessions
0 8,539,553 81.9%
1 417,601 4.0%
2 220,343 2.1%
3 146,100 1.4%
4 107,981 1.0%
5 87,037 0.8%
6 71,666 0.7%
7 60,965 0.6%
8 53,245 0.5%
9 47,090 0.5%
10 42,428 0.4%
11 38,363 0.4%
12 35,622 0.3%
13 33,110 0.3%
14 31,304 0.3%
15 29,564 0.3%
16 27,731 0.3%
17 26,901 0.3%
18 25,756 0.2%
19 24,961 0.2%
20 32,789 0.3%
21 24,904 0.2%
22 24,220 0.2%
23 23,925 0.2%
24 24,088 0.2%
25 24,996 0.2%
26 26,855 0.3%
27 30,177 0.3%
28 39,114 0.4%
29 108,722 1.0%

The table above groups a session into buckets for each minute.  The biggest bucket by number of sessions is the bucket of 0 minutes. This bucket has sessions that are up to 59 seconds in length and accounts for 8,539,553 or 82% of the sessions in the dataset.

Duration Sessions Percent of Sessions Under 1 Min
0 sec 5,892,556 69%
1-9 sec 1,476,112 17%
10-19 sec 478,262 6%
20-29 sec 257,916 3%
30-39 sec 181,326 2%
40-49 sec 140,492 2%
50-59 sec 112,889 1%

You might be wondering about those sessions that lasted only zero seconds.  There are 5,892,556 of them which is 69% of the sessions that were under one minute.  These are almost always sessions that used items as part of an embedded link, a pdf view directly from another site (google, twitter, webpage) or a similar kind of view.

Next Steps

This post helped us get a better look at the data that we are working with.  There is a bit of strangeness here and there with the data but this is pretty normal for situations where you work with access logs.  The Web is a strange place full of people, spiders, bots,  and scripts.

Next up we will actually dig into some of the research questions we had in the first post.  We know how we are going to limit our data a bit to get rid of some of the outliers in the number of items used and we’ve given a bit of information about the large number of very short duration sessions.  So more to come.

If you have questions or comments about this post,  please let me know via Twitter.

User Session Analysis: Connections Between Collections, Type, Institutions

I’ve been putting off some analysis that a few of us at the UNT Libraries have wanted to do with the log files of the UNT Libraries Digital Collections.  This post (and probably a short series to follow) is an effort to get back on track.

There are three systems that we use to provide access to content and those include: The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.

In our digital collections there are a few things that we’ve said over time that we feel very strongly about but which we’ve never really measured.  First off we have said that there is value in co-locating all of our content in the same fairly uniform system instead of building visually and functionally distinct systems for different collections of items.  So instead of each new project or collection going into a new system, we’ve said there is not only cost savings, but real value in putting them all together in a single system.  We’ve said “there is an opportunity for users to not only find content from your collection, but they could find useful connections to other items in the overall digital library”.

Another thing we’ve said is that there is value in putting all different types of digital objects together into our digital systems.  We put the newspapers, photographs, maps, audio, video, and datasets together and we think there is value in that.  We’ve said that users will be able to find newspaper issues, photographs, and maps that might meet their need.  If we had a separate newspaper system, separate video or audio system some of this cross-type discovery would never take place.

Finally we’ve said that there is great value in locating collections from many institutions together in a system like The Portal to Texas History.  We thought (and still think) that users would be able to do a search and it will pull resources together from across institutions in Texas that have matching resources. Because of the geography of the state, you might be finding things that are physically located 10 or 12 hours away from each other at different institutions. In the Portal, these could be displayed together, something that would be challenging if they weren’t co-located in a system.

In our mind these aren’t completely crazy concepts but we do run into other institutions and practitioner that don’t always feel as strongly about this as we do.  The one thing that we’ve never done locally is look at the usage data of the systems and find out:

  • Do users discover and use items from different collections?
  • Do users discover and use items that are different types?
  • Do users discover and use items that are from different contributing partners?

This blog post is going to be the first in a short series that takes a  look at the usage data in the UNT Libraries Digital Collections in an attempt to try and answer some of these questions.

Hopefully that is enough background, now let’s get started:

How to answer the questions.

In order to get started we had to think a little bit about how we wanted to pull together data on this.  We have been generating item-based usage for the digital library collections for a while.  These get aggregated into collection and partner statistics that we make available in the different systems.  The problem with this data is that it just shows what items were used and how many times in a day they were used.  It doesn’t show what was used together.

We decided that we needed to go back to the log files from the digital collections and re-create user sessions to group item usage together.  After we have information about what items were used together we can sprinkle in some metadata about those items and start answering our questions.

With that as a plan we can move to the next step.

Preparing the Data

We decided to use all of the log files for 2017 from our digital collections servers.  This ends up being 1,379,439,042 lines of Apache access logs (geez, over 1.3 billion, or 3.7 million server requests a day).  The data came from two different servers that collectively host all of the application traffic for the three systems that make up the UNT Libraries’ Digital Collections.

We decided that we would define a session as all of the interactions that a single IP address has with the system in a 30 minute window.  If a user uses the system for more than 30 minutes, say 45 minutes, that would count as one thirty minute session and one fifteen minute session.

We started by writing a script that would do three things.  First it would ignore lines in the log file that were from robots and crawlers.  We have a pretty decent list of these bots so that was easy to remove.  Next we further reduced the data by only looking at digital object accesses.  Specifically lines that looked something like ‘/ark:/67531/metapth1000000/`. This pattern in our system denotes an item access and these are what we were interested in.  Finally we only were concerned with accesses that returned content so we only looked at lines that returned a 200 status code.

We filtered the log files down to three columns of data.  The first column was the timestamp for when the http access was made,  the second column was the has of the hashed IP address used to make the request, and the final column was the digital item path requested.  This resulted in a much smaller dataset to work with, from 1,379,439,042 down to 144,405,009 individual lines of data.

Here is what a snipped of data looks like

1500192934      dce4e45d9a90e4a031201b876a70ec0e  /ark:/67531/metadc11591/m2/1/high_res_d/Bulletin6869.pdf
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/43/high_res/
1500192940      fa057cf285725981939b622a4fe61f31  /ark:/67531/metadc98866/m1/41/high_res/
1500192944      b63927e2b8817600aadb18d3c9ab1557  /ark:/67531/metadc33192/m2/1/high_res_d/dissertation.pdf
1500192945      accb4887d609f8ef307d81679369bfb0  /ark:/67531/metacrs10285/m1/1/high_res_d/RS20643_2006May24.pdf
1500192948      decabc91fc670162bad9b41042814080  /ark:/67531/metadc504184/m1/2/small_res/
1500192949      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/
1500192951      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/small_res/
1500192950      c8a320f38b3477a931fabd208f25c219  /ark:/67531/metadc1729/m1/9/med_res_d/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/1/med_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/3/small_res/
1500192953      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/2/small_res/
1500192952      f7948b68f7b52fd15c808beee544c131  /ark:/67531/metadc52714/m1/4/small_res/
1500192955      67ef5c0798dd16cb688b94137b175f0b  /ark:/67531/metadc848614/m1/2/small_res/
1500192963      a19ce3e92cd3221e81b6c3084df2d4a6  /ark:/67531/metadc5270/m1/254/med_res/
1500192961      ea9ba7d064412a6d09ff708c6e95e201  /ark:/67531/metadc85867/m1/4/high_res/

You can see the three columns in the data there.

The next step was actually to sort all of this data by the timestamp in the first column.  You might notice that not all of the lines are in chronological order in the sample above.  By sorting on the timestamp, things will fall into order based on time.

The next step was to further reduce this data down into sessions.  We created a short script that we could feed the data into and it would keep track of the ip addresses it came across, note the objects that the ip hash used, and after a thirty minute period of time (based on the timestamp) it would start the aggregation again.

The result was a short JSON structure that looked like this.

  "arks": ["metapth643331", "metapth656112"],
  "ip_hash": "85ebfe3f0b71c9b41e03ead92906e390",
  "timestamp_end": 1483254738,
  "timestamp_start": 1483252967

This JSON has the ip hash, the starting and ending timestamp for that session, and finally the items that were used.  Each of these JSON structures were placed into a file, a line-oriented set of JSON “files” that would get used in the following steps.

This new line-oriented JSON file is 10,427,111 lines long, with one line representing a single user session for the UNT Libraries’ Digital Collections.  I think that’s pretty cool.

I think I’m going to wrap up this post but in the next post I will take a look at what these users sessions look like with a little bit of sorting, grouping, plotting, and graphing.

If you have questions or comments about this post,  please let me know via Twitter.