Monthly Archives: October 2015

Date values in the UNT Libraries’ Digital Collections

This past week I was clearing out a bunch of software feature request tickets to prepare for a feature push for our digital library system.  We are getting ready to do a redesign of The Portal to Texas History and the UNT Digital Library interfaces.

Buried deep in our ticketing system were some tickets made during the past five years that included notes about future implementations that we could create for the system.  One of these notes caught my eye because it had the phrase “since date data is so poor in the system”.  At first I had dismissed this phrase and ticket altogether because our ideas related to the feature request had changed, but later that phrase stuck with me a bit.

I began to wonder,  “what is the quality of our date data in our digital library” and more specifically “what does the date resolution look like across the UNT Libraries’ Digital Collections”.

Getting the Data

The first thing to do was to grab all of the date data for each record in the system.  At the time of writing there were 1,310,415 items in the UNT Libraries Digital Collections.  I decided the easiest way to grab the date information for these records was to pull it from our Solr index.

I constructed a solr query that would return the value of our dc_date field, the ark identifier we use to uniquely identify each item in the repository, and finally which of the systems (Portal, Digital Library, or Gateway) a record belongs to.

I pulled these as JSON files with 10,000 records per request,  did 132 requests and I was in business.

I wrote a short Python little script that takes those Solr responses and converts them into a tab separated format that looks like this:

ark:/67531/metapth2355  1844-01-01  PTH
ark:/67531/metapth2356  1845-01-01  PTH
ark:/67531/metapth2357  1845-01-01  PTH
ark:/67531/metapth2358  1844-01-01  PTH
ark:/67531/metapth2359  1844-01-01  PTH
ark:/67531/metapth2360  1844  PTH
ark:/67531/metapth2361  1845-01-01  PTH
ark:/67531/metapth2362  1883-01-01  PTH
ark:/67531/metapth2363  1844  PTH
ark:/67531/metapth2365  1845  PTH

Next I wrote another Python script that classifies a date into the following categories:

  • Day
  • Month
  • Year
  • Other-EDTF
  • Unknown
  • None

Day, Month, and Year are the three units that I’m really curious about,  I identified these with simple regular expressions for yyyy-mm-dd, yyyy-mm, and yyyy respectively.  For records that had date strings that weren’t day, month, or year, I checked if the string was an Extended Date Time Format string.  If it was valid EDTF I marked it as Other-EDTF, if it wasn’t a valid EDTF and wasn’t a day, month, year I marked it as Unknown.  Finally if there wasn’t a date present for a metadata record at all, it is marked as “None”.

One thing to note about the way I’m doing the categories,  I am probably missing quite a few values that have day, month or years somewhere in the string by not parsing the EDTF and Unknown strings a little more liberally for days, months and years.  This is true but for what I’m trying to accomplish here, I think we will let that slide.

What does the data look like?

The first thing for me to do was to see how many of the records had date strings compared to the number of records that do not have date strings present.

Date values vs none

Date values vs none

Looking at the numbers shows 1,222,750 (93%) of records having date strings and 87,665 (7%) are missing date strings.  Just with those numbers I think that we negate the statement that “date data is poor in the system”.  But maybe just the presence of dates isn’t what the ticket author meant.  So we investigate further.

The next thing I did was to see how many of the dates overall were able to be classified as a day, month, or year.  The reasoning for looking at these values is that you can imagine building user interfaces that make use of date values to let users refine their searching activities or browse a collection by date.

Identified Resolution vs Not

Identified Resolution vs Not

This chart shows that the overwhelming majority of objects in our digital library 1,202,625 (92%) had date values that were either day, month, or year and only 107,790 (8%) were classified as “Other”. Now this I think does blow the statement about poor date data quality away.

The last thing I think there is to look at is how each of the categories stack up against each other.  Once again, a pie chart.

UNT Digital Libraries Date Resolution Distribution

UNT Digital Libraries Date Resolution Distribution

Here is a table view of the same data.

Date Classification Instances Percentage
Day 967,257 73.8%
Month 43,952 3.4%
Year 191,416 14.6%
Other-EDTF 15,866 1.2%
Unknown 4,259 0.3%
None 87,665 6.7%

So looking at this data it is clear that the majority of our digital objects have the resolution at the “day” level with 967,257 records or 73.8% of all records being in the format yyyy-mm-dd.  Next year resolution is the second highest occurrence with 191,416 or 14.6%.  Finally Month resolution came in with 43,952 records or 3.4%.  There were 15,866 records that had valid EDTF values, 4,259 with other date values and finally the 87,665 records that did not contain a data at all.

Conclusion

I think that I can safely say that we do in fact have a large amount of date data in our digital libraries.  This date data can be parsed easily into day, month and year buckets for use in discovery interfaces, and by doing very basic work with the date strings we are able to account for 92% of all records in the system.

I’d be interested to see how other digital libraries stand on date data to see if we are similar or different as far as this goes.  I might hit up my colleagues at the University of Florida because their University of Florida Digital Collections is of similar scale with similar content. If you would like to work to compare your digital libraries’ date data let me know.

Hope you enjoyed my musings here, if you have thoughts, suggestions, or if I missed something in my thoughts,  please let me know via Twitter.

Writing the UNT Libraries Digital Collections to tape.

When we created our digital library infrastructure a few years ago, one of the design goals of the system was that we would create write once digital objects for the Archival Information Packages (AIPs) that we store in our Coda repository.

Currently we store two copies of each of these AIPs, one locally in the Willis Library server room and another copy in the UNT System Data Center at the UNT Discovery Park research campus that is five miles north of the main campus.

Over the past year we have been working on a self-audit using the TRAC Criteria and Checklist as part of our goal in demonstrating that the UNT Libraries Digital Collections is a Trusted Digital Repository.  In addition to this TRAC work we’ve also used the NDSA Levels of Preservation to help frame where we are with digital preservation infrastructure, and were we would like to be in the future.

One of the things that I was thinking about recently is what it would take for us to get to Level 3 of the NDSA Levels of Preservation for “Storage and Geographic Location”

“At least one copy in a geographic location with a different disaster threat”

In thinking about this I was curious what the lowest cost would be for me to get this third copy of my data created, and moved someplace that was outside of our local disaster threat area.

First some metrics

The UNT Libraries’ Digital Collections has grown considerably over the past five years that we’ve had our current infrastructure.

Growth of the UNT Libraries' Digital Collections

As of this post, we have 1,371,808 bags of data containing 157,952,829 file  in our repository,  taking up 290.4 TB of storage for each copy we keep.

As you can see by the image above, the growth curve has changed a bit starting in 2014 and is a bit steeper than it had been previously.  From what I can tell it is going to continue at this rate for a while.

So I need to figure out what it would cost to store 290TB of data in order to get my third copy.

Some options.

There are several options to choose from for where I could store my third copy of data,  I could store my data with a service like Cronopolis, MetaArchive, DPN, or DuraSpace to name a few.  These all have different cost models and different services, and for what I’m interested in accomplishing with this post and my current musing,  these solutions are overkill for what I want.

I could use either a cloud based service like Amazon Glacier, or even work with one of the large high performance computing facilities like TACC at the University of Texas to store a copy of all of my data.  This is another option but again not something I’m interested in musing about in this post.

So what is left?  Well I could spin up another rack of storage, put our Coda repository software on top of it and start replicating my third copy, but the problem is getting it in a rack that is several hundred miles away,  UNT doesn’t have any facilities in locations outside of the DFW area so that is out of the question.

So finally I’m leaving myself to think about tape infrastructure, and specifically about getting an LTO-6 setup to spool a copy of all of our data to and then send those tapes off to a storage facility,  possibly something like the TSLAC Records Management Services for Government Agencies.

Spooling to Tape

So in this little experiment I was interested in finding out how many LTO-6 tapes it would take to store the UNT Libraries Digital Collections.  I pulled a set of data from Coda that contained the 1,371,808 bags of data and the size of each of those bags in bytes.

The uncompressed capacity of LTO-6 tape is 2.5 TB so some quick math says that it will take 116 tapes to write all of my data.  This is probably low because that would assume that I am able to completely fill each of the tapes with exactly 2.5 TB of data.

I figured that there were going to be at least three ways for me to approach distributing digital objects to disk,  they are the following:

  • Write items in the order that they were accessioned
  • Write items in order from smallest to largest
  • Fill each tape to the highest capacity before moving to the next

I wrote three small python scripts that simulated all three of these options to find the number of tapes needed as well as the overall storage efficiency of that method.  I decided I would only fill a tape with 2.4 TB of data to give myself plenty of wiggle room. Here are the results

Method Number of Tapes Efficiency
Smallest to Largest 136 96.91%
In order of accession 136 96.91%
Fill a tape completely 132 99.85%

In my thinking, the simplest way of writing objects to tape would be to order the objects by their accession date, write files to a tape until it is full, when it is full start writing to another tape.

If we assume that a tape costs $34 dollars,  the overhead of this less efficient but simplest way of writing is only an overhead of $116 dollars which to me is completely worth it.  This way, in the future I could just continue to write tapes as new content gets ingested by just picking up where I left off.

So from what I can figure from my poking around on Dell.com and various tape retailers,  I’m going to be out roughly $10,000 for my initial tape infrastructure that would include a tape autoloader and a server to stage files to from our Coda repository.  I would have another cost of $4,352 to get my 136 LTO-6 tapes to accommodate my current 290 TB of data in Coda.  If I assume a five year replacement rate for this technology (so that I can spread the initial costs out over five years) that will leave me with a cost of just about $50 per-TB,  if I divide that over the five year lifetime of the technology,  that’s $10 per-TB-per-year.

If you like GB prices better I’m coming up with $.01 cents per-GB or $.002 cents per-GB-per-year cost.

If I was going to use Amazon Glacier (calculations are with an unofficial Amazon Glacier calculated and assume a whole bunch of things that I’ll gloss over related to data transfer) I come up with a cost of $35,283.33 per year instead of my roughly calculated $2870.40 per year. (I realize that these cost comparison aren’t for the same service and Glacier includes extra redundancy, but you get the point I think)

There is going to be another cost associated with this which is the off-site storage of 136 LTO-6 tapes.  As of right now I don’t have any idea of those costs but assume that it could be done anywhere from very cheaply as part of an MOU with another academic library for little or no cost, or something more costly like a contract with a commercial service.  I’m interested to see if UNT would be able to take advantage of the services offered by TSLAC and their Records Management Services.

So what’s next?

I’ve had fun musing about this sort of thing for the past day or so.  I have zero experience with tape infrastructure and from what I can tell it can get as cool and feature rich as you are willing to pay.  I like the idea of keeping it simple so if I can work directly with a tape autoloader with some command line tools like tar and mt,  I think that is what I would prefer.

Hope you enjoyed my musings here, if you have thoughts, suggestions, or if I missed something in my thoughts,  please let me know via Twitter.