In the previous post in this series I laid out the work that we were going to do with session data from the UNT Libraries’ Digital Collections. In order to get the background that this post builds from take a quick look at that post.
In this post we are going to look at the data for the 10,427,111 user sessions that we generated from the 2017 Apache access logs from the UNT Libraries Digital Collections.
Items Per Sessions
The first thing that we will take a look at in the dataset is information about how many different digital objects or items are viewed during a session.
|Items Accessed||Sessions||Percentage of All Sessions|
I grouped the item uses per session in order to make the table a little easier to read. With 86% of sessions being single item accesses that means we have 14% of the sessions that have more than one item access. This is still 1,447,967 sessions that we can look at in the dataset so not bad.
You can also see that there are a few sessions that have a very large number of items associated with them. For example there are 19 sessions that have over 1,000 items being used. I would guess that this is some sort of script or harvester that is masquerading as a browser.
Here are some descriptive statistics for the items per session data.
For further analysis we will probably restrict our sessions to those that have under 20 items used in a single session. While this might remove some legitimate sessions that used a large number of items, it will give us numbers that we can feel a bit more confident about. That will leave 1,415,596 or 98% of the sessions with more than one item used still in the dataset for further analysis.
Duration of Sessions
The next thing we will look at is the duration of sessions in the dataset. We limited a single session to all interactions by an IP address in a thirty minute window so that gives us the possibility of sessions up to 1,800 seconds.
|Minutes||Sessions||Percentage of Sessions|
The table above groups a session into buckets for each minute. The biggest bucket by number of sessions is the bucket of 0 minutes. This bucket has sessions that are up to 59 seconds in length and accounts for 8,539,553 or 82% of the sessions in the dataset.
|Duration||Sessions||Percent of Sessions Under 1 Min|
You might be wondering about those sessions that lasted only zero seconds. There are 5,892,556 of them which is 69% of the sessions that were under one minute. These are almost always sessions that used items as part of an embedded link, a pdf view directly from another site (google, twitter, webpage) or a similar kind of view.
This post helped us get a better look at the data that we are working with. There is a bit of strangeness here and there with the data but this is pretty normal for situations where you work with access logs. The Web is a strange place full of people, spiders, bots, and scripts.
Next up we will actually dig into some of the research questions we had in the first post. We know how we are going to limit our data a bit to get rid of some of the outliers in the number of items used and we’ve given a bit of information about the large number of very short duration sessions. So more to come.
If you have questions or comments about this post, please let me know via Twitter.