Monthly Archives: March 2015

Metadata Edit Events: Part 4 – Duration, buckets

This is the fourth in a series of posts related to metadata edit events collected by the UNT Libraries from its digital library system from January 1, 2014 until December 31, 2014.  The previous posts covered when, who, and what.

This post will start the discussion on the “how long” or duration of the dataset.

Libraries, archives, and museums have long discussed the cost of metadata creation and improvement projects, depending on the size, complexity and experience of the metadata creators,  the costs associated with metadata generation, manipulation and improvement can vary drastically.

The amount of time that a person takes to create or edit a specific metadata record is often used in the calculations of what projects will cost to complete.  At the UNT Libraries we have used $3.00 per descriptive record as our metadata costs for projects, and based on the level of metadata created, workflows use, and the system we’ve developed for metadata creation, this number seems to do a good job of covering our metadata creation costs. It will be interesting to get a sense of how much time was spent editing metadata records over the past year and also plotting that to collections, type, formats and partners.  This will involve a bit of investigation of the dataset before we get to those numbers though.

Here is a quick warning about the rest of the post,  for me I’m stepping out into deeper water for me with the analysis I’m going to be doing with our 94,222 edit events. From what I can tell from my research is that there are many ways to go about some of this and I’m not at all claiming that I have the best or even a good approach.  But it has been fun so far.

 Duration

The reason we wanted to capture event data when we created our Metadata Edit Event Service was to get a better idea of how much time our users were spending on the task of editing metadata records.

This is accomplished by adding a log value into the system with a timestamp, identifier, and username when a record is opened,  and when the record is published back into the system the original log time is subtracted from the published time which results in the number of seconds that were taken for the metadata event. (a side note,  this is also the basis for our record locking mechanism so that two users don’t try and edit the same record at the same time)

There are of course a number of issues with this model that we noticed, first what if the users opens a record and forgets about it and goes to lunch then comes back and publishes the record.  What happens if they open a record and then close it, what happens to that previous log event, is it used the next time?  What happens if a user opens multiple records at once in different tabs,  if they aren’t using the other tabs immediately they are adding time without really “editing” the records.  What if a user makes use of a browser automation tool like Selenium,  won’t that skew your data?

The answer to many of these questions is “yep that happens” and how we deal with them in the data is something that I’m trying to figure out,  I’ll walk you through what I’m doing so far to see if it makes sense.

Looking at the Data

Hours

As a reminder,  there are 94,222 edit events in the dataset.  The first thing I wanted to take a look at is how they group into buckets based on hours.  I took the durations and divided them by 3600 with floor division so i should get buckets of 0,1,2,3,4,….and so on.

Below is a table of these values.

Hours Event Count
0 93,378
1 592
2 124
3 41
4 20
5 5
6 8
7 7
8 1
9 4
10 6
11 2
12 1
14 3
16 5
17 3
18 2
19 1
20 1
21 2
22 2
23 2
24 3
25 1
26 1
29 1
32 2
37 1
40 2
119 1

And then a pretty graph of that same data.

Edit Event durations grouped by hour

Edit Event durations grouped by hour

What is very obvious about this table and graph is that the vast majority with 93,378 (99%) of the edit events taking under one hour to finish.  We already see some outliers with 119 hours (almost an entire work week.. that’s one tough record) on the top end of event duration list.

While I’m not going to get into it with this post,  it would be interesting to see if there are any patterns to find in the 844 records that took longer than an hour to create.  What percentage of that users records took over an hour,  do they come from similar collections, types, formats, or partners?  Something for later I guess.

Minutes

Next I wanted to look at the edit events that took less than an hour to complete,  where do they sit if I put them in buckets based on 60 seconds.  Filtering out the events that took more than an hour to complete leaves me 93,378 events.  Below is the graph of these edit events.

Edit Event durations grouped by hour for events taking under one hour to complete.

Edit Event durations grouped by minute for events taking under one hour to complete.

You can see a dramatic curve for the edit events as the number of minutes goes up.

I was interested to see where the 80/20 split for this dataset would be and it appears to be right about six minutes.  There are 17,397 (19%) events occurring from 7-60 minutes and 75,981 (81%) events from 0-6 minutes in length.

Seconds

Diving into the dataset one more time I wanted to look at the 35,935 events that happened in less than a minute.  Editing a record in under a minute for me takes a few different paths.  First you could be editing a simple field like changing a language code or a resource type,  second you could be just “looking” at a record and instead of closing the record you hit “publish” again. You might also be switching a record from the hidden state to the unhidden state (or vice versa), finally you might be using a browser automation tool to automate your edits.   Let’s see if we can spot any of these actions when we look at the data.

Edit Event durations for events taking under one minute to complete.

Edit Event durations for events taking under one minute to complete.

By just looking at the data above it is hard to say which of the kinds of events mentioned above map to different parts of the curve.  I think when we start to look at individual users and collections some of this information might make a little more sense.

This is going to wrap up this post,  in the next post I’m hoping to define the cutoff that will designate “outliers” from data that we want to use for the calculation of average times for metadata creation and then see how that looks for our various users in the system.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 3 – What

This is the third post in a series related to metadata event data that we collected from January 1, 2014 to December 31, 2014 for the UNT Libraries Digital Collections.  We collected 94,222 metadata editing events during this time.

The first post was about the when of the events,  when did they occur, what day of the week and what day of the week the occurred.

The second post touched on the who of the events,  who were the main metadata editors, how were edits distributed among the different users, and how the number of years per month, day, hour were distributed.

This post will look at the what of the events data.  What were the records that were touched,  what collections or partners did they belong to and so on.

Of the total 94,222 edit events there were 68,758 unique metadata records edited.

By using the helpful st program we can quickly get the statistics for these 68,758 unique metadata records.  By choosing the “complete” stats we get the following data.

N min q1 median q3 max sum mean stddev stderr
68,758 1 1 1 1 45 94,222 1.37034 0.913541 0.0034839

With this we can see that there is a mean of 1.37 edits per record over the entire dataset with the maximum number of edits for a record being 45.

The total distribution of number of edits-per-record a presented in the table below.

Number of Edits Instances
1 53,213
2 9,937
3 3,519
4 1,089
5 489
6 257
7 111
8 60
9 30
10 13
11 14
12 7
13 5
14 5
15 1
16 2
17 1
19 1
21 1
26 1
30 1
45 1

From the 68,758 records edited,  53,213 (77%) of the records were only edited once, with two and three edits per record edit 9,937 (14%),  and 3,519 (5%) respectively. From there things level out very quickly to under 1% of the records.

When indexing these edit events in Solr I also merged the events with additional metadata from the records.  By doing so we have a few more facets to take a look at, specifically how the edit events are distributed over partner, collection, resource type and format.

Partner/Institution

There are 167 partner institutions represented in the edit event dataset.

The top ten partners by the number of edit events is presented in the table below.

Partner Code Partner Name Edit Count Unique Records Edited Unique Collections
UNTGD UNT Libraries Gov Docs Department 21,932 14,096 27
OKHS Oklahoma Historical Society 10,377 8,801 34
UNTA UNT Libraries Special Collections 9,481 6,027 25
UNT UNT Libraries 7,102 5,274 27
PCJB Private Collection of Jim Bell 5,504 5,322 1
HMRC Houston Metropolitan Research Center at Houston Public Library 5,396 2,125 5
HPUL Howard Payne University Library 4,531 4,518 4
UNTCVA UNT College of Visual Arts and Design 4,296 3,464 5
HSUL Hardin-Simmons University Library 2,765 2,593 6
HIGPL Higgins Public Library 1,935 1,130 3

In addition to the number of edit events,  I have added a column for the number of unique records for each of the institutions.  The same data is presented in the graph below.

Graph showing the edit event count and unique record count for each of the institutions with the most edit events

Graph showing the edit event count and unique record count for each of the institutions with the most edit events

The larger the difference between the Edit Count and the Unique Records Edited represents more repetitive edits of the same records by that partner.

The final column in the table above shows the number of different collections that were edited that belong to each specific partner.  Taking UNTGD as an example, there are 27 different collection that held records that were edited during the year.

Collection Code Collection Name Edit Events Records Edited
TLRA Texas Laws and Resolutions Archive 8,629 5,187
TXPT Texas Patents 7,394 4,636
TXSAOR Texas State Auditor’s Office: Reports 2,724 1,223
USCMC United States Census Map Collection 1,779 1,695
USTOPO USGS Topographic Map Collection 490 458
TRAIL Technical Report Archive and Image Library 287 279
CRSR Congressional Research Service Reports 271 270
FCCRD Federal Communications Commission Record 211 208
NACA National Advisory Committee for Aeronautics Collection 62 62
WWPC World War Poster Collection 49 49
WWI World War One Collection 41 41
USDAFB USDA Farmers’ Bulletins 21 19
ATOZ Government Documents A to Z Digitization Project 19 18
WWII World War Two Collection 19 19
ACIR Advisory Commission on Intergovernmental Relations 14 13
NMAP World War Two Newsmaps 12 12
TR Texas Register 12 8
TXPUB Texas State Publications 12 12
GAORT Government Accountability Office Reports 10 10
BRAC Defense Base Closure and Realignment Commission 4 4
OTA Office of Technology Assessment 4 4
GDCC CyberCemetery 2 2
FEDER Federal Communications Commission Record 1 1
GSLTX General and Special Laws of Texas 1 1
TXHRJ Texas House of Representatives Journals 1 1
TXSS Texas Soil Surveys 1 1
UNTGOV Government Documents General Collection 1 1

This is set of data that is a bit easer to see with a simple graph.  I’ve plotted the ratio of records and the number of edit events to a simple line graph.

UNT Government Documents Edits to Record Ratios for each collection.

UNT Government Documents Edits to Record Ratios for each collection.

You can look at the graph above and quickly see which of the collections have had a higher edit-to-record ratio with the Texas State Auditor’s Office: Reports being the most number of edits per record with a ratio of over 2 edits per record for that collection.  Many of the other collections are much closer to 1 where there would be one edit per record.

Collections

The edit events occur in 266 different collections in the UNT Libraries’ Digital Collections.  As with the 167 partners above,  that is too many to stick into a table so I’m going to just list the top ten of them for us in the table below.

Collection Code Collection Name Edit Events Unique Records
TLRA Texas Laws and Resolutions Archive 8,629 5,187
ABCM Abilene Library Consortium 8,481 8,060
TDNP Texas Digital Newspaper Program 7,618 6,305
TXPT Texas Patents 7,394 4,636
OKPCP Oklahoma Publishing Company Photography Collection 5,799 4,729
JBPC Jim Bell Texas Architecture Photograph Collection 5,504 5,322
TCO Texas Cultures Online 5,490 2,208
JJHP John J. Herrera Papers 5,194 1,996
UNTETD UNT Theses and Dissertations 4,981 3,704
UNTPC University Photography Collection 4,509 3,232

Again plotting the ratio of edit events to the number of unique records gives us the graph below.

Edit Events to Record Ratio grouped by Collection

Edit Events to Record Ratio grouped by Collection

You can quickly see the two collections that averaged over two edit events for each of the records that were edited during the last year,  meaning if a record was edited,  most likely it was edited at least two times.  Other collections like the Jim Bell Photography Collection or the Abilene Library Consortium Collection appear to have only been edited one time per record on average,  so when the edit was complete, it wasn’t revisited for additional editing.

Resource Type

The UNT Libraries makes use of a locally controlled vocabulary for its resource types.  You can view all of the available resource types here .

If you group the edit events and associated edit events by the resource type you will get the following table.

Resource Type Edit Events Unique Records
image_photo 31,702 24,384
text_newspaper 11,598 10,176
text_leg 8,633 5,191
text_patent 7,480 4,667
physical-object 5,591 4,921
text_etd 4,986 3,709
text 4,311 2,511
text_letter 4,276 2,136
image_map 3,542 3,160
text_report 3,375 1,822
image_artwork 1,217 1,042
text_article 1,060 758
video 931 461
sound 719 694
text_legal 687 341
text_journal 549 288
text_book 476 422
image_presentation 430 313
image_postcard 429 180
image_poster 427 321
text_paper 423 312
text_pamphlet 303 199
text_clipping 275 149
text_yearbook 91 66
dataset 54 19
image_score 49 37
collection 41 34
image 34 20
website 22 20
text_chapter 17 14
text_review 13 11
text_poem 3 1
specimen 1 1

By calculating the edit-event-to-record ratio and plotting that you get the following graph.

Edit Events to Record Ratio grouped by Resource Type.

Edit Events to Record Ratio grouped by Resource Type.

In the graph above I presented the data in the same order as it appears in the table just above the chart.  You can see that the highest ratio is for our text_poem record that was edited three different times.  Other notably high ratios are for postcards and datasets though there are several others that are at or close to 2 to 1 ratio of edits to records.

Format

The final way we are going to look at the “what” data is by Format.  Again the UNT Libraries uses a controlled vocabulary for the format which you can look at here.  I’ve once again facetted on the format field and presented the total number of edit events and then unique records for each of the five format types that we have in the system.

Format Edit Events Unique Records
text 48,580 32,770
image 43,477 34,436
video 931 461
audio 720 695
website 22 20

Converting the ratio of events-to-records into a bar graph results in the graph below.

Edit Events to Record Ratio grouped by Format

Edit Events to Record Ratio grouped by Format

It looks like we edit video files more times per record than any of the other types with text and then image coming in behind.

Closing

There are almost endless combinations of collections, partners, resource types, and formats that can be put together and it deserves some further analysis to see if there are patters that we should pay attention to present in the data.  But that’s more for another day.

This is the third in a series of posts related to metadata edit events in the UNT Libraries’ Digital Collections.  check back for the next installment.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 2 – Who

In the previous post I started to explore the metadata edit events dataset generated from 94,222 edit events from 2014 for the UNT Libraries’ Digital Collections.  I focused on some of the information about when these edits were performed.

This post focuses on the “who” of the dataset.

All together we had 193 unique users edit metadata for one of the systems that comprise the UNT Libraries’ Digital Collections.  This includes The Portal to Texas History, UNT Digital Library, and the Gateway to Oklahoma History.

The top ten most frequent editors of metadata in the system are responsible for 57% of the overall edits.

Username Edit Events
htarver 15,451
aseitsinger 10,105
twarner 4,655
mjohnston 4,143
atraxinger 3,905
cwilliams 3,490
sfisher 3,466
thuang 3,327
mphillips 2,669
sdillard 2,518

The overall distribution of edits per user looks like this.

Distribution of edits per user for the Edit Event Dataset

Distribution of edits per user for the Edit Event Dataset

As you can see it shows the primary users of the system and then very quickly tapers down to the “long tail” of users who have a lower number of edit events.

A quick look at the total number of users active for given days of the week across the entire dataset.

Sun Mon Tue Wed Thu Fri Sat
40 95 122 122 123 97 39

There is a swell for Tue, Wed, and Thu in the table above.  It seems to be pretty consistent, either you have 39,40 users, 95-97 users, or 122-123 unique users on a given day of the week.

In looking at how unique users were spread across the year, grouped into months,  we got the following table and then graph.

Month Unique Users
January 54
February 73
March 64
April 61
May 44
June 40
July 48
August 50
September 50
October 84
November 49
December 36
Unique Editors Per Month

Unique Editors Per Month

There were some spikes throughout the year,  most likely related to a metadata class in the UNT College of Information that uses the Edit system as part of their teaching.  This is the October and February spikes in number of unique users.  Other than that we are a consistently over 40 unique users per month with a small dip for the December holiday season when school is not is session.

In the previous post we had a heatmap with the number of edit events distributed over the hours of the day and the days of the week.  I’ve included that graph below.

94,222 edit events plotted to the time and day they were performed

94,222 edit events plotted to the time and day they were performed

I was curious to see how the unique number of editors mapped to this same type of graph,  so that is included below.

Unique editors distribution across day of the week and hour of the day.

Unique editors distribution across day of the week and hour of the day.

User Status

Of the 193 unique metadata editors in the dataset, 135 (70%) of the users were classified as Non-UNT-Employee and  58 (30%) were classified as UNT-Employee. For the edit events themselves, 75,968 (81%) were completed by users classified with a status of UNT-Employee  and 18,254 (19%) by users classified with the status of Non-UNT-Employee.

User Rank

Rank Edit Events Percentage of Total Edits (n=94,222) Unique Users Percentage of Total Users (n=193)
Librarian 22,466 24% 16 8%
Staff 12,837 14% 13 7%
Student 41,800 44% 92 48%
Unknown 17,119 18% 72 37%

You can see that 44% of all of the edits in the dataset were completed by users who were students. Librarians and Staff members accounted for 38% of the edits.

This is the second in a series of posts related to metadata edit events in the UNT Libraries’ Digital Collections.  check back for the next installment.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 1 – When

This is going to be another several post series as I wade through some of the data we have been collecting for the past year related to metadata editing and various events within a metadata record’s lifecycle.

Background

For the past few years the UNT Libraries has been collecting data about how long our metadata editors are spending editing records in our systems.  We’ve written on the overall change of metadata in our digital library and presented those findings as last years Dublin Core Metadata Initiatives conference in Austin Texas with a paper called “How Descriptive Metadata Changes in the UNT Libraries’ Collection: A Case Study“. The goal of collecting data about metadata change is that we will have a better idea of how our metadata editors are interacting with our systems.

What is an edit event?

Our metadata system will create a log entry when a user opens a record to begin editing.  This log acts as the start of a timer for the given edit session of that specific record by a given user.  When the user publishes that metadata record back into the system the log entry is queried,  the amount of time that has passed is recorded along with the metadata editors username,  identifier for the record and state (hidden or unhidden) is in when the item is saved.  This information is submitted to the Metadata Event Service and logged.

An edit event ends up looking like this once it has been created

id event_date duration username record_id record status record status change record quality record quality change
73515 2014-01-04T22:57:00 24 mphillips ark:/67531/metadc265646 1 0 1 0

With this information we are able to create a number of views into the metadata editing workflow in our environment,  we can easily see the number of metadata edits on a given day, within the month and for the entire period we’ve been collecting data.  We can view the total number of edits,  the number of unique records edited, and finally the number of hours that our users have spent editing records within a given period.

Below are a few screenshots from our Edit Event Service web-interface.

Homepage for the UNT Libraries Edit Event Service

Homepage for the UNT Libraries Edit Event Service

Daily View for the UNT Libraries Edit Event Service

Daily View for the UNT Libraries Edit Event Service

Monthly View for the UNT Libraries Edit Event Service

Monthly View for the UNT Libraries Edit Event Service

Yearly View for the UNT Libraries Edit Event Service

Yearly View for the UNT Libraries Edit Event Service

User Detail View for the UNT Libraries Edit Event Service

User Detail View for the UNT Libraries Edit Event Service

We are able to query a given day, month, year to view statistics as well as show the rankings and information for a specific user or digital object in the system.

Analyzing a year of data.

We were interested in taking a deeper look at the metadata edit events and that is what the following posts in this series will cover.  A year’s worth of metadata edit data was extracted from the event service.  This was paired with two other datasets,  descriptive metadata about the items editing including contributing institution, collection, resource type and format fields. We also classified each user in the dataset with their status as either an UNT-Employee or Non-UNT-Employee, and finally their rank as either Librarian, Staff, Student, or Unknown rank.  These datasets were merged to form a complete record for each metadata event in the Edit Events Dataset.  They were added to a Solr index that was used in analyzing this data.

A total of 94,222 edit events occurred from January 1, 2014 to December 31, 2014 and are the base dataset for the analysis presented here.

Month, Day, Hour

During 2014 we averaged 7,852 metadata edits per month

January 10,133
February 5,082
March 5,960
April 5,543
May 6,622
June 5,136
July 8,099
August 10,508
September 10,989
October 12,840
November 7,712
December 5,598
Monthly Metadata Edit Events for the University of North Texas

Monthly Metadata Edit Events for the University of North Texas

Looking at the day of the week that metadata edits occurred shows the expected pattern of the majority of metadata editing activities taking place during the week with fewer happening on the weekend.  The breakdown by day of the week is presented in the table below.

Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2,765 17,506 19,580 16,876 20,838 14,416 2,241
Metadata Edit Events for the University of North Texas by weekday

Metadata Edit Events for the University of North Texas by weekday

The hour of day that metadata is edited is interesting to take a look at.  For the most part you will see the majority of editing being done during the work week,  with the afternoons being the time of day that most records are edited.  The full data is presented below.

Hour Edit Events
0:00 237
1:00 77
2:00 58
3:00 41
4:00 19
5:00 86
6:00 290
7:00 601
8:00 1,836
9:00 6,189
10:00 8,948
11:00 8,868
12:00 8,134
13:00 10,760
14:00 11,653
15:00 11,184
16:00 9,114
17:00 4,868
18:00 3,564
19:00 2,439
20:00 1,947
21:00 1,787
22:00 937
23:00 585

Presented as a graph you can easily see the swell of metadata editing in the afternoons.

Metadata Edit Events for the University of North Texas by hour of the day

Metadata Edit Events for the University of North Texas by hour of the day

If you combine the day of the week and hour of the day data into a single table you will get something like this.

94,222 edit events plotted to the time and day they were performed

94,222 edit events plotted to the time and day they were performed

In the image above,  green is lower number of edits and red represents higher numbers of edits.  It shows that Thursday afternoons tend to be very busy, while Friday is much lighter compared to other days of the week.

That’s it for the first post in this series,  I have a plan for information about Who is editing records,  What records are they editing, and then finally How Much time are we spending on metadata editing.  Check back for future posts.

As always feel free to contact me via Twitter if you have questions or comments.

 

 

Item States in our Digital Repository

One of the things that I keep coming back to in our digital library system are the states that an object can be in and how that affects various aspects of our system.  Hopefully this post can explain some of them and how they are currently implemented locally.

Hidden vs Non-Hidden

Our main distinction once an item is in our system is if it is hidden or not.

Hidden means that it is not viewable by any of our users and that it is only available in our internal Edit system where a metadata record and basic access exists to the item. If a request for this items comes in through our public facing digital library interfaces,  the user will receive a “404 Not Found” response from our system.

If a record is not hidden then it is viewable and discoverable in one of our digital library interfaces.  If an end user tries to access this item there may be limitations based on the level of access,  or any embargoes on the item that might be present.

In our metadata scheme UNTL,  we notate if an item is hidden or not in the following way.  If there is a value of <meta qualifier=”hidden”>True</meta> then the item is considered hidden.  If there is a value of <meta qualifier=”hidden”>False</meta> then the item is considered not hidden.  If there is no element with qualifier of hidden then the default is placed as False in the system and it is considered not hidden.

This works pretty well for basic situations and with the assumption that nobody will ever make a mistake.

But… People make mistakes.

Deleted Items

The first issue we ran into when we started to scale up our systems is that from time to time we would accidentally load the same resource into the system twice.  This happens for a variety of reasons.  User error on the part of the ingest technician (me) is the major cause of this.  Also there are a number of times that the same item will be sent through the digitization/processing queue a number of times because of the amount of time that passes for some projects to complete.  There are other situations where the same item will be digitized again because the first instance was poorly scanned, and instead of updating the existing record it is added a second time.  For all of these situations we needed to have a way of suppressing these records

Right now we add an element to the metadata record that is <meta qualified=recordStatus”>deleted</meta> which designates that this item has been suppressed in the system and that it should be effectively forgotten.  On the technical side this triggers a delete from the Solr index, which holds our metadata indexes and the item is then gone.

When a user requests an item that is deleted she will currently receive a “404 Not Found” though we have an open ticket to change this behavior to return a “410 Gone” status code for these items. Another limitation of our current process of just deleting these from our Solr index is that we are not able to mark them as “deleted” in our OAI-PMH repositories which isn’t ideal. Finally by purging these items completely from our system we have no way of knowing how many have been suppressed/deleted, or not easy way of making the items visible again.

These suppressed records are only deleted from the Solr index but all of their edit history and the records themselves.  In fact if you know that an item used to be in a non-suppressed state, and remember the ARK identifier you can still access the full record,  remove the recordStatus flag and un-suppress the item.  Assuming you remember the identifier.

What does hidden really mean?

So right now we have hidden, and non-hidden and deleted and non-deleted.  The deleted items are effectively forgotten about,  but what about those hidden items,  what do they mean.

Here are some of the reasons that we have hidden records vs non-hidden records.

Metadata Missing

We have a workflow for our system that allows us to ingest stub records which have minimal descriptive metadata in place for items so that they can be edited in our online editing environment by metadata editors around the library, university, and state.  These are loaded with minimal title information (usually just the institution’s unique identifier for the item), the partner and collection that the item belongs to, and any metadata that makes sense to set across a large set of records.  Once in the editing system these items will have metadata created for them over time and be made available to the end user.

Hard Embargoes

While our system has built-in functionality for embargoing an item,  this functionality will always make available the descriptive metadata for the item to the public.   In our UNT Scholarly Works Repository, we work to make the contact information for the creators of the item known so that you can “request a copy” of the item if you discover it but if it is still under an embargo. Here is an example item that won’t become available until later this year.

Sometimes this is not the desired way of presenting the embargoed items to the public.  For example we work with a number of newspaper publishers around Texas who make available their PDF print masters to UNT for archiving and presentation via The Portal to Texas History.  They do so with the agreement that we will not make their items available until one, two, or three years after publication. Instead of presenting the end user with an item they aren’t able to access in the Portal,  we just have these items hidden until they are ready to be made available. I have a feeling that this method will be changed soon in the future because it becomes a large metadata management problem.

Finally there are items that we are either digitizing or capturing which we do not have the ability to provide access to because of current copyright restrictions.  We have these items in a hidden state in the system until either an agreement can be reached with the rights holder, or until the item falls into the public domain.

Right not it is impossible for us to identify how many of these items are being held as “embargoed” by the use of a hidden item flag.

Copyright Challenge, or Personally Identifiable Information

We have another small set of items (less than a dozen… I think) that are hidden because there is an active copyright challenge we are working with for the item, or because the item contained personally identifiable information.  Our first step in these situations is to mark the item as hidden until the item or the situations can be resolved.  If situation with the item has been successfully resolve and access restored to the item, it is marked as un-hidden.

Others?

I’m sure there are other reasons that an item can be hidden within a system,  I would be interested in hearing your reasons within your collections especially if they are different from the ones listed above.  I’m blissfully unaware of any controlled vocabularies for these kinds of states that a record might be in within digital library systems so if there is prior work in this area I’d love to hear about it.

As always feel free to contact me via Twitter if you have questions or comments.

DPLA Metadata Analysis: Part 4 – Normalized Subjects

This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.

This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.

Background

One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific.  For example the heading “Children–Texas.” is constructed from two different pieces,  “Children”, and “Texas”.  If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.

The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis.  I had a question asked about what would happen if we normalized the subjects before we did the analysis on them,  effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.

Note:  Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index.  I also apply this same method to the un-normalized subjects.  In doing so I noticed that the item that had the  most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once.  Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects,  each of these values should go down.

Predictions

My predictions before the analysis are that we will see an increase in the number of unique subjects,  a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.

Results

With the normalization of subjects,  there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization,  a reduction in the number of unique subject headings by 38%.

In addition to the reduction of the total number of unique subject headings by 38% as stated above,  the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%.  The table below displays these numbers before and after normalization as well as the percentage change.

# of Hubs with Subject # of Subjects # of Normalized Subjects % Change
1 1,717,512 1,055,561 -39%
2 114,047 60,981 -47%
3 21,126 20,172 -5%
4 8,013 9,483 18%
5 3,905 5,130 31%
6 2,187 3,094 41%
7 1,330 2,024 52%
8 970 1,481 53%
9 689 1,080 57%
10 494 765 55%
11 405 571 41%
12 302 453 50%
13 245 413 69%
14 199 340 71%
15 152 261 72%
16 117 205 75%
17 63 152 141%
18 62 130 110%
19 32 77 141%
20 20 55 175%
21 7 38 443%
22 7 23 229%
23 0 2 N/A

The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”

The high level stats for all 8,012,390 records are available in the following table.

 Records Total Subject Strings Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
8,012,390 23,860,080 28,644,188 2.98 3.57 20.05%

You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.

Results by Hub

The table below presents data for each hub in the DPLA.  The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.

Hub Records Total Subject String Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
ARTstor 56,342 194,883 202,220 3.46 3.59 3.76
Biodiversity Heritage Library 138,288 453,843 452,007 3.28 3.27 -0.40
David Rumsey 48,132 22,976 22,976 0.48 0.48 0
Digital Commonwealth 124,804 295,778 336,935 2.37 2.7 13.91
Digital Library of Georgia 259,640 1,151,351 1,783,884 4.43 6.87 54.94
Harvard Library 10,568 26,641 36,511 2.52 3.45 37.05
HathiTrust 1,915,159 2,608,567 4,154,244 1.36 2.17 59.25
Internet Archive 208,953 363,634 412,640 1.74 1.97 13.48
J. Paul Getty Trust 92,681 32,949 43,590 0.36 0.47 32.30
Kentucky Digital Library 127,755 26,008 27,561 0.2 0.22 5.97
Minnesota Digital Library 40,533 202,456 211,539 4.99 5.22 4.49
Missouri Hub 41,557 97,111 117,933 2.34 2.84 21.44
Mountain West Digital Library 867,538 2,636,219 3,552,268 3.04 4.09 34.75
National Archives and Records Administration 700,952 231,513 231,513 0.33 0.33 0
North Carolina Digital Heritage Center 260,709 866,697 1,207,488 3.32 4.63 39.32
Smithsonian Institution 897,196 5,689,135 5,686,107 6.34 6.34 -0.05
South Carolina Digital Library 76,001 231,267 355,504 3.04 4.68 53.72
The New York Public Library 1,169,576 1,995,817 2,515,252 1.71 2.15 26.03
The Portal to Texas History 477,639 5,255,588 5,410,963 11 11.33 2.96
United States Government Printing Office (GPO) 148,715 456,363 768,830 3.07 5.17 68.47
University of Illinois at Urbana-Champaign 18,103 67,954 85,263 3.75 4.71 25.47
University of Southern California. Libraries 301,325 859,868 905,465 2.85 3 5.30
University of Virginia Library 30,188 93,378 123,405 3.09 4.09 32.16

The number of unique subjects before and after subject normalization is presented in the table below.  The percent of change is also included in the final column.

Hub Unique Subjects Unique Normalized Subjects % Change Unique
ARTstor 9,560 9,546 -0.15
Biodiversity Heritage Library 22,004 22,005 0
David Rumsey 123 123 0
Digital Commonwealth 41,704 39,557 -5.15
Digital Library of Georgia 132,160 88,200 -33.26
Harvard Library 9,257 6,210 -32.92
HathiTrust 685,733 272,340 -60.28
Internet Archive 56,911 49,117 -13.70
J. Paul Getty Trust 2,777 2,560 -7.81
Kentucky Digital Library 1,972 1,831 -7.15
Minnesota Digital Library 24,472 24,325 -0.60
Missouri Hub 6,893 6,757 -1.97
Mountain West Digital Library 227,755 172,663 -24.19
National Archives and Records Administration 7,086 7,086 0
North Carolina Digital Heritage Center 99,258 79,353 -20.05
Smithsonian Institution 348,302 346,096 -0.63
South Carolina Digital Library 23,842 17,516 -26.53
The New York Public Library 69,210 36,709 -46.96
The Portal to Texas History 104,566 97,441 -6.81
United States Government Printing Office (GPO) 174,067 48,537 -72.12
University of Illinois at Urbana-Champaign 6,183 5,724 -7.42
University of Southern California. Libraries 65,958 64,021 -2.94
University of Virginia Library 3,736 3,664 -1.93

The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.

Hub Subjects Unique to Hub Normalized Subject Unique to Hub % Subjects Unique to Hub % Normalized Subjects Unique to Hub % Change
ARTstor 4,941 4,806 52 50 -4
Biodiversity Heritage Library 9,136 6,929 42 31 -26
David Rumsey 30 28 24 23 -4
Digital Commonwealth 31,094 27,712 75 70 -7
Digital Library of Georgia 114,689 67,768 87 77 -11
Harvard Library 7,204 3,238 78 52 -33
HathiTrust 570,292 200,652 83 74 -11
Internet Archive 28,978 23,387 51 48 -6
J. Paul Getty Trust 1,852 1,337 67 52 -22
Kentucky Digital Library 1,337 1,111 68 61 -10
Minnesota Digital Library 17,545 17,145 72 70 -3
Missouri Hub 4,338 3,783 63 56 -11
Mountain West Digital Library 192,501 134,870 85 78 -8
National Archives and Records Administration 3,589 3,399 51 48 -6
North Carolina Digital Heritage Center 84,203 62,406 85 79 -7
Smithsonian Institution 325,878 322,945 94 93 -1
South Carolina Digital Library 18,110 9,767 76 56 -26
The New York Public Library 52,002 18,075 75 49 -35
The Portal to Texas History 87,076 78,153 83 80 -4
United States Government Printing Office (GPO) 105,389 15,702 61 32 -48
University of Illinois at Urbana-Champaign 3,076 2,322 50 41 -18
University of Southern California. Libraries 51,822 48,889 79 76 -4
University of Virginia Library 2,425 1,134 65 31 -52

Conclusion

Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization.  It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs,  normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.

I hope you found this post interesting,  if you want to chat about the topic hit me up on Twitter.