Monthly Archives: January 2015

UNT Libraries’ Digital Collections: 2014 Review – Items Added

One thing that tends to be hard in the digital library world is to understand how a given program is doing in relation to other programs throughout the country. This information can be helpful to help justify funds spent locally on digital library initiatives. The same information can be used within a department to understand if workflows are on par with others throughout the country/region.

Most often the numbers that are reported are those that are required by membership groups such as ARL, ACRL and others who have a token question or two about digital library statistics but most people involved with those numbers know that they are often…. unclear at best.

Some of the dimensions that are available to look at include traffic to the digital library system, visitors, page views, time on site, referral traffic. Locally we use Google Analytics for this data at the repository level. How a digital libraries items get used is also another metric that is helpful in knowing the impact of these resources. This can be measured in a wide range of ways and there are initiatives such as Counter that provide some guidance to this sort of work but it feels like it is more focused on “Electronic Resources” and doesn’t really handle the range of cases we run into in digital library/repository land. The University of Florida Digital Collections makes their usage data for each item in the collection easily obtainable, many modern DSpace instances also have great reporting on usage of items. I’ve talked a little about how UNT Libraries calculates “uses” for our digital library collections here and here. The final area that is often reported on is the collection growth of the repository either in the number of items added, number of bytes (or GB, TB) added, or number of files added in a given year.

I think walking through some of these metrics in a series of posts will be helpful for me to articulate some of the opportunities that are available if the digital libraries/repository community openly shared more of this data. There are of course organizations such as Hathi Trust, the Digital Public Library of America, and others who make growth data available front and center, but for most of our repositories it is pretty hidden.

The data that I’m showing in this post is from the UNT Libraries Digital Collections which contains three separate digital library interfaces, The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History. All three of these interfaces are powered by the same repository infrastructure on the backend and are made searchable by a unified Solr index. The datasets here are from that Solr instance directly.

Items added per month

From Jan 1 to Dec 31, 2014 the UNT Libraries Digital Collections added 417,645 unique digital resources to its holdings. The breakdown of the monthly additions look like this:

Month	Items Added
January	32,074
February	9,220
March	7,758
April	11,161
May	11,475
June	32,549
July	18,503
August	67,769
September	83,916
October	25,537
November	73,404
December	44,279

A better way to look at this might be a simple chart.

UNT Libraries Digital Collections: Growth by Month in 2014

Or looked at a different way.

The average number of items added to the system in 2014 by month is 34,803.

Wait, What is an Item,Object,Resource

A little side trip is needed so that we are on the same page. For us a “digital object” or “digital item” or “digital resource” is an intellectual unit that a descriptive metadata record is assigned at. This may be a scan of a photographic negative, front and back scans of a physical photographic print, a book, letter, pamphlet, map, or issue of newspaper. In most instances there are multiple files/images/pages per item in our system but we are just talking about those larger units and not the files that make up the items themselves. Just wanted to make sure we were on the same page about that.

Items added per day

In looking at the daily data for the year, there were 215 days that new content was processed and added to the collection with no processing being done on 150 days. The average number of items added per day during the year was 1,144 items. If we think about an ten hour work day (roughly when the library is open for normal folks) that’s 114 items per hour, or 1.9 new items created per minute during the work week last year.

Items by Type

I thought it might be interesting to see how the 417,645 were distributed among the various resource types that we categorize records into. Here is that table.

Resource Type	Items
image_photo	197,133
text_newspaper	109,456
image_map	66,637
text_report	12,569
text	9,517
text_patent	7,052
text_etd	4,449
physical-object	3,573
text_leg	1,660
text_book	1,171
text_journal	1,063
video	804
text_article	494
image_postcard	366
collection	347
text_pamphlet	346
text_letter	235
text_legal	216
text_yearbook	180
image_presentation	96
image_artwork	44
text_clipping	44
dataset	36
image_poster	30
image	26
text_paper	23
image_score	22
sound	17
website	13
text_review	12
text_chapter	8
text_prose	5
text_poem	1

As you can see the majority of all of the items added were in the category of image_photo (Photographs) or text_newspaper (Newspapers) with those two types accounting for 73% of the new additions to the system.

Closing

As I mentioned at the beginning of this post, I think knowing metrics of other digital library programs is helpful for local initiatives in a number of ways. The UNT Libraries had a very successful year for adding new content, over the past few years we’ve been able to double the number of items each year, I don’t think that’s a rate of growth that we can keep up with but it is always fun to try. How do repository systems at your institution look in relation to this? Sharing that data more broadly would be helpful to the digital library community overall and I encourage others to take some time and make this data available.

If you have any specific questions for me let me know on twitter.

What do we put in our BagIt bag-info.txt files?

The UNT Libraries makes heavy use of the BagIt packaging format throughout our digital repository infrastructure. I’m of the opinion that BagIt is one of the technologies that has contributed more toward moving digital preservation forward in the last ten years than any other one technology/service/specification. The UNT Libraries uses BagIt for our Submission Information Packages (SIP), our Archival Information Packages (AIP), our Dissemination Information Packages, and our local Access Content Package (ACP).

For those that don’t know BagIt, it is a set of conventions for packaging content into a directory structure in a consistent and repeatable way. There are a number of other descriptions of BagIt that do a very good job of describing the conventions and some of the more specific bits of the specification.

There are a number of great tools for creating, modifying and validating BagIt bags, and my favorite for a long time has been bagit-python from the Library of Congress. (To be honest I usually am using Ed Summers fork which I grab from here)

The BagIt specification has a metadata file that is stored in the root of a bag, this metadata file is called bag-it.txt. The BagIt specification has a number of fields defined for this file which are stored as key value pairs in the file in the format of.

key: value

I thought it might be helpful for those new to using BagIt bags to see what kinds of information we are putting into these bag-info.txt files, and also explain some of the unique fields that we are adding to the file for managing items in our system. Below is a typical bag-info.txt file from one of our AIPs in the Coda Repository.

Bag-Size: 28.32M
Bagging-Date: 2015-01-23
CODA-Ingest-Batch-Identifier: f2dbfd7e-9dc5-43fd-975a-8a47e665e09f
CODA-Ingest-Timestamp: 2015-01-22T21:43:33-0600
Contact-Email: mark.phillips@unt.edu
Contact-Name: Mark Phillips
Contact-Phone: 940-369-7809
External-Description: Collection of photographs held by the University of North
 Texas Archives that were taken by Junebug Clark or other family
 members. Master files are tiff images.
External-Identifier: ark:/67531/metadc488207
Internal-Sender-Identifier: UNTA_AR0749-002-0016-0017
Organization-Address: P. O. Box 305190, Denton, TX 76203-5190
Payload-Oxum: 29666559.4
Source-Organization: University of North Texas Libraries

In the example above, several of the fields are boiler plate, and others are machine generated.

Field	How we create the Value
Bag-Size	Machine
Bagging-Date	Machine
CODA-Ingest-Batch-Identifier	Machine
CODA-Ingest-Timestamp	Machine
Contact-Email	Boiler-Plate
Contact-Name	Boiler-Plate
Contact-Phone	Boiler-Plate
External-Description	Changes per “collection”
External-Identifier	Machine
Internal-Sender-Identifier	Machine
Organization-Address	Boiler-Plate
Payload-Oxum	Machine
Source-Organization	Boiler-Plate

You can tell from looking at the example bag-info.txt file above that some of the fields are very self explanatory. I’m going to run over a few of the fields that either are non-standard, or that we’ve made explicit decisions on as we were implementing BagIt.

CODA-Ingest-Batch-Identifier is a UUID for each batch of content added to our Coda Repository, this helps us identify other items that may have been added during a specific run of our ingest process, helpful for troubleshooting.

CODA-Ingest-Timestamp is the timestamp when the AIP was added to the Coda Repository.

External-Identifier will change for each collection that gets processed, it has just enough information about the collection to help jog someone’s memory about where this item came from and why it was created.

External-Identifier is the ARK identifier assigned the item on ingest into one of the Aubrey systems where we access the items or manage the descriptive metadata.

Internal-Sender-Identifier is the locally important (often not unique) identifier for the item as it is being digitized or collected. It often takes the shape of an accession number from our University Special Collections, or the folder name of an issue of newspaper.

We currently have 1,070,180 BagIt bags in our Coda Repository and they have be instrumental in us being able to scale our digital library infrastructure and verify that each item is just the same as when we added it to our collection.

If you have any specific questions for me let me know on twitter.

Using the Extended Date Time Format (EDTF)

The UNT Libraries has adopted the use of the Extended Date Time Format (EDTF) from the Library of Congress for our various digital library systems.

This allows us to use a machine readable notation to represent some of the date values that we run across in digital libraries (or libraries, archives and museums in general).

Things like

circa 1922
July 2014
Winter 2000
3rd of July (maybe?) 1922

Which get preresented in EDTF as

1922~
2014-07
2000-24
1922-07?-03

I’ve been interested in working with date values in digital libraries for a while because they are one of the major ways that we try and slice our collections in an attempt to do interesting things with discovery, display and exploring.

The work that really got me thinking working with date values in metadata was Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers and the related Specifications for Metadata Processing Tools by Roy Tennant back in 2004 when he was at the California Digital Library working on a metadata aggregation and parsing project. Out of that project came an interesting Date Parsing Utility

In the Bitter Harvest paper referenced above the following list of dates was given by Roy as an example of the kinds of dates strings that are pretty normal to see in metadata in digital libraries.

1991-10-01
ca. 1920.
(ca). 1920)
2001.06.08 by CAD
Unknown
ca. June 19, 1901.
(ca). June 19, 1901)
[2001 or 2002.]
1853.
c1875.
c1908 November 19
c1905
1929 June 6
[between 1904 and 1908]
[ca. 1967]
1918 ?
[1919 ?]
191-?
1870 December, c1871
1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929

We were seeing the same kinds of date values in our records and didn’t really have a sane way of dealing with them so we decided to become early adopters of the EDTF because it gave us a specification to follow to encode most of these kinds of date values.

In 2013 Hannah Tarver presented a paper at the Dublin Core Metadata Initiative conference in Lisbon titled Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library that discussed some analysis we did at UNT over the 460,000 metadata records that were in the UNT Libraries Digital Collections at the Time. Poke around in that paper, it is kind of interesting (full disclosure, I’m a co-author so very biased)

Tools for working with EDTF

There are a few tools listed on the EDTF website for working with the EDTF date values. At UNT we developed one of these tools to use in our system and I wanted to talk a little about what it does and how it could be used by others.

We wrote a Python based EDTF Validator which will verify if a given string is in fact a valid EDTF string or if it is incorrectly formatted. The source for this validator is available on the UNT Libraries’ GitHub repository as the ExtendedDateTimeFormat.

In order to use this tool in our local systems we developed a simple JSON Web service that wraps the validation tool in an HTTP based API for validating these dates. You can interact with this service and read the documentation at the site Extended Date Time Format Levels 0, 1 and 2 Validation Service.

For the People Users

We use this service in our metadata editing tool everyday. When a user enters or adjust a date in a date field, the interface will call this Web service which in turn will return a True or False as to if the submitted date string is valid or invalid. We will present this information back to the user so that they know they need to adjust the formatting of the date string they supplied.

Invalid date notification in UNT Libraries Metadata Editor

The image above shows what the notification of an invalid date in the user interface looks like. If the date string is a valid EDTF then it will not produce any notification.

Because the EDTF is pretty foreign to most folks when they first get started using it we try and offer examples of how to notate common date concepts in our Descriptive Metadata Input Guidelines. You can view the date specific examples at this link.

For the Computer Users

We also make use of the ExtenededDateTimeFormat Python module directly when we index each record into our Solr index. We have a field “valid_dates” which is a boolean field with True set as the default value. While we index a record if the record has dates, each of the dates are validated against the EDTF validator. If any of them fail validation the field is stored with a False value instead of True. This allows us quickly to view the number of records which have invalid dates of some sort and work to remove them from the database. Once identified they can pretty easily be corrected.

For the Public Users

Finally we wanted to be able to turn some of these date strings into a more human readable format in our end user interfaces. EDTF is first a machine readable format which isn’t always obvious to people users. For example most people don’t really know what 2012-21 means and most actually struggle with 1999-04. For many of the easy to convert date strings we try and provide an equivalent for the user such as “Spring 2012” or “April 1999” instead. For the more complex strings we provide a link to a guide on how to interpret the EDTF used in our records.

Over the past few years we’ve found that the EDTF has worked well for a large majority of dates that we run into in our system. There are some concepts that are a little challenging to understand at first but in no time users are able to start using the EDTF to its fullest.

The next challenge that we don’t exactly have an answer to is exactly how some of these date strings should be indexed so that they can be faceted and sorted along with other more common ISO standard dates.

If you have any specific questions for me let me know on twitter.

Digital Preservation System Interfaces: UNT Libraries Coda Repository

I mentioned to a colleague that I would be happy to do a short writeup of some of the interfaces that we have for our digital preservation system. This post is trying to move forward that conversation a bit.

System 1, System 2

At UNT we manage our digital objects in a consistent and unified way. What this means in practice is that there is one way to do everything, items are digitized, collected, or created, staged for ingest into the repository and everything moves into the system in the same way. We have two software stacks that we use for managing our digital items, Aubrey and Coda.

Aubrey is our front-end interface which provides end user access to resources, search, browsing, and display. For managers it provides a framework for defining collections, partners, and most importantly it has a framework for creating and managing metadata for the digital objects. Most of the interaction (99.9%) of the daily interaction with the UNT Libraries Digital Collections is through Aubrey with one of its front-end user interfaces, The Portal to Texas History, the UNT Digital Library, or The Gateway to Oklahoma History.

Aubrey manages the presentation versions of a digital object, locally we refer to this package of files as an Access Content Package, or ACP. The other system in this pair is a system we call Coda. Coda is responsible for managing the Archival Information Packages (AIP) in our infrastructure. Coda was designed to manage a collection of BagIt Bags, help with the replication of these bags and allow curators and managers to access the master digital objects if needed.

What does it look like though?

The conversation I had with a colleague was around user interfaces to the preservation archive, how much or how little we are providing and our general thinking about that system’s user interfaces. Typically these interfaces are “back-end” and usually are never seen by a larger audience because of layers of authentication and restriction. I wanted to take a few screenshots and talk about some of the interactions that users have with these systems.

Main Views

The primary views for the system include a dashboard view which gives you an overview of the happenings within the Coda Repository.

UNT Libraries’ Coda Dashboard

From this page you can navigate to lists for the various sub-areas within the repository. If you want to view a list of all of the Bags in the system you are able to get there by clicking on the Bags tile.

Bag List View – UNT Libraries’ Coda Repository

The storage nodes that are currently registered with the system are available via the Nodes button. This view is especially helpful in gauging the available storage resources and deciding which storage node to write new objects to. Typically we use one storage node until it is completely filled and then move onto another storage node.

Nodes List View – UNT Libraries’ Coda Repository

For events in the coda system including ingest, replication, migration, and fixity check we create and store a PREMIS Event. These are aggregated using the PREMIS Event Service

PREMIS Event List View – UNT Libraries’ Coda Repository

The primary Coda instance is considered the Coda instance of record and additional Coda instances will poll the primary for new items to replicate. They do this using ResourceSync to broadcast available resources and their constituent files. Because the primary Coda system does not have queued items this list is empty.

Replication Queue List View – UNT Libraries’ Coda Repository

To manage information about what piece of software is responsible for an event on an object we have a simple interface to list PREMIS Agents that are known to the system.

PREMIS Agents List View – UNT Libraries’ Coda Repository

Secondary Views

With the primary views out of the way the next level that we have screens for are the detail views. There are detail views for most of the previous screens once you’ve clicked on a link.

Below is the detail view of a Bag in the Coda system. You will see the parsed bag-info.txt fields as well as PREMIS Events that are associated with this resource. You have the buttons at the top which will get you to a list of URLS that when downloaded will re-constitute a given Bag of content and the ATOM Feed for the object.

Bag Detail View – UNT Libraries’ Coda Repository

Here is a URLS list, if you download all of these files and keep the hierarchy of the folders you can validate the Bag and have a validated version of the item plus additional metadata. This is effectively the Dissemination Information Package for the system.

Coda URLs List – UNT Libraries’ Coda Repository

An Atom Feed is created for each document as well which can be used by the AtomPub interface for the system. Or just to look at and bask in the glory of angle brackets.

Atom Feed for Bag – UNT Libraries’ Coda Repository

Below is the detail view of a PREMIS Event in the repository. You can view the Atom Feed for this document or navigate to the Bag in the system that is associated with this event.

PREMIS Event Detail View – UNT Libraries’ Coda Repository

The detail of a storage node in the system. These nodes are updated to reflect the current storage statistics for the storage nodes in the system.

Node Detail View – UNT Libraries’ Coda Repository

The detail view of a PREMIS Agent is not too exciting but is included for completeness.

Agent Detail View – UNT Libraries’ Coda Repository

Interacting with Coda

When there is a request for the master/archival/preservation files for a given resource we find the local identifier for the resource, put that into the Coda repository and do a quick search

Dashboard with Search – UNT Libraries’ Coda Repository

You will end up with search results for one or more Bags in the repository. If there is more than one for that identifier select the one you want (based on the date, size, or number of files) and go grab the files.

Search Result – UNT Libraries’ Coda Repository

Statistics.

The following screens show some of the statistics views for the system. They include the Bags added per month and over time, number of files added per month and over time, and finally the number of bytes added per month and over time.

Stats: Monthly Bags Added – UNT Libraries’ Coda Repository

Stats: Running Bags Added Total – UNT Libraries’ Coda Repository

Stats: Monthly Files Added – UNT Libraries’ Coda Repository

Stats: Running Total of Files Added – UNT Libraries’ Coda Repository

Stats: Monthly Size Added – UNT Libraries’ Coda Repository

Stats: Running Total Sizes – UNT Libraries’ Coda Repository

What’s missing.

There are a few things missing from this system that one might notice. First of all is the process of authentication to the system. At this time the system is restricted to a small list of IPs in the library that have access to the system. We are toying around with how we want to handle this access as we begin to have more and more users of the system and direct IP based authentication becomes a bit unwieldy.

Secondly there is a full set of Atom Pub interfaces for each of the Bag, Node, PREMIS Event, PREMIS Agent, and Queue sections. This is how new items are added to the system. But that it a little bit out of scope for this post.

If you have any specific questions for me let me know on twitter.

Metadata Quality, Completeness, and Minimally Viable Records

The quality of metadata records for digital library objects is a subject that comes up pretty often at work. We haven’t stumbled upon any solid answers to overall questions about measuring, improving, or evaluating metadata but we have given a few things a try. Here are a few examples of one of these components that we have found useful.

UNTL Metadata

The UNT Libraries’ Digital Collections consists of The Portal to Texas History, the UNT Digital Library, and The Gateway to Oklahoma History. At the time of writing this post we have 1,049,483 metadata records in our metadata store. Our metadata model uses the primary fifteen Dublin Core Metadata Elements which we describe as “locally qualified” so for example a Title with a qualifier of “Main Title”, or “Added Title” and a Subject that is a namedPerson or LCSH, or MESH. In addition to those fifteen elements we have added a few other qualified fields such as Citation, Degree, Partner, Collection, Primary Source, Note, and Meta. These all make up a metadata format we call locally UNTL. This is all well documented by our Digital Projects Unit on our Metadata Guidelines pages. All of the controlled vocabularies we use as qualifiers to our metadata elements are available in our Controlled Vocabularies App . We typically serialize our metadata format as an XML record on disk, each item in our system exposes the raw UNTL metadata record in addition to other formats. Here is an example record for an item in The Portal. To simplify the reading, writing and processing of metadata records in our system we have a Python module called pyuntl that we use for all things UNTL metadata related.

Completeness

The group of folks in the Digital Libraries Division that were interested in metadata quality have been talking about ways to measure quality in our systems for some time. As the conversation isn’t new, we have quite a bit of literature on the subject to review. We noticed that when librarians begin to talk about “qualify of metadata” we tend to get a bit bristly, saying “well what really is quality” and “but not in all situations” and so on. We wanted to come up with a metric that we could use as a demonstration of the value of defining some of the concepts of quality and moving them into a system in an actionable way. We decided that a notion of completeness could be a good way of moving forward because when defining what were the required fields for a record, it would be easy to assess in a neutral fashion if a metadata has the required fields or not.

For our completeness metric we identified the following fields as needing to be present in a metadata record in order for us to consider it a “minimally viable record” in our system.

Title
Description
Subject/Keywords
Language
Collection
Partner
Resource Type
Format
Meta Information for Record

The idea was that even if one did not know much of anything about an object, that we would be able to describe it at the surface level, assign it a title, language value, subject/keyword and give it a resource type and format. The meta information about the item and the Institution and Collection elements are important for keeping track of where items come from, who owns them and to make the system work properly. We also assigned a weight to some of these fields saying that some of the elements carry more weight than others. Here is that breakdown.

Title = 10
Description = 1
Subject/Keywords = 1
Language = 1
Collection = 1
Partner = 10
Resource Type = 5
Format = 1
Meta Information for Record = 20

The pyuntl library has functionality built into it that calculates a completeness score from 0.0 to 1.0 based on these weights with a record with a score of 1.0 beings “complete” or at least “minimally viable” and records with a score lower than 1.0 being deficient in some way.

In addition to this calculated metric we try and provide metadata creators with visual cues indicating that a “required” field is missing or partial. The image below is an example of what is shown to a metadata editor as they are creating metadata records.

Metadata Editing Interface for the UNT Libraries

Detail of fields sidecar for uncompleted metadata record

Metadata editing interface for partially completed record in the UNT Libraries Digital Collections

Detail of fields sidecar for partially completed metadata record.

Our hope is that by providing this information to metadata creators, they will know when they have created at least a minimally viable metadata record, or if they pull up a record to edit that is not “complete” that they can quickly assess and fix the problem.

When we are indexing our items into our search system we calculate and store this metric so that we can take a look at our metadata records as a whole and see how we are doing. While we currently only use this metric behind the scenes, we are hoping to move it more front and center in the metadata editing interface. As it stands today here is the breakdown of records in the system.

Completeness Score	Number of Records
1.0	1,007,503
0.9830508	17,307
0.9661017	12,142
0.9491525	12,526
0.9322034	1
0.7966102	3
0.779661	1

As can be seen there is still quite a bit of cleanup to be done on the records in the system in order to make the whole dataset “minimally viable” but it is pretty easy to identify which records are missing what now, and edit them accordingly. It can allow us to focus directly on metadata editing tasks which can be directly measured as “improvements” to the system a as a whole.

How are other institutions addressing this problem? Are we missing something? Hit me up on twitter if this is an area of interest.