This is the first in a series of posts about some of the infrastructure that sits behind the UNT Libraries’ digital library initiatives. That infrastructure looks like this to the public. http://texashistory.unt.edu/ and http://digital.library.unt.edu/
The digital object model we use at the UNT Libraries is modeled after digital objects we’ve had contact with over the past five years. Additionally we established use cases for several kinds of objects that had previously been unsupported by our previous data model, all of this information informed our thinking when it came to modeling and kept us on task.
In the UNTL model, a digital object is described by a single descriptive metadata file. The model doesn’t specify a specific metadata format, however all of our systems have been built around the format we call UNTL, which is a locally qualified Dublin Core metadata model with XML, Python and JSON representations.
Each digital object has one or more manifestations. A manifestation can be thought of as a “format” or type, if pressed it isn’t too hard to map this over to the FRBR idea of manifestation, at least in my mind.
Each manifestation contains one or more fileSets. A fileSet is a grouping of files that are related in some logical way.
Lets jump into an example to see how this works.
Lets say we have a book we want to digitize. We decide to scan each of the pages in the book, we want to OCR the page of the book so we can search the full-text in some undisclosed system. Already we can see that we have two files that have a relation to each other, the master tiff image and the OCR text file
Here is a short list of what those files could look like.
000100tp.tif 000100tp.txt 00020000.tif 00020000.txt 00030001.tif 00030001.txt ...
You can see how things pair up in the example. Lets call each of these groups a fileSet.
fileSet_1 000100tp.tif 000100tp.txt fileSet_2 00020000.tif 00020000.txt fileSet_3 00030001.tif 00030001.txt
Each of these fileSets in our example belong to one manifestation, basically a manifestation that is a bunch of tiff files with OCR.
manifestation_1 fileSet_1 fileSet_2 fileSet_3
We choose to give the object as a whole a descriptive metadata record which describes the content in the digital object.
descriptive_metadata manifestation_1 fileSet_1 fileSet_2 fileSet_3
For good measure we want to give the digital object a globally unique identifier so that we never get it mixed up with anything else.
ark:/67531/metapth10232 descriptive_metadata manifestation_1 fileSet_1 fileSet_2 fileSet_3
Well that’s fine and all but it really isn’t that interesting.
Nope not really at all.
So lets get a little more into this.
What if we wanted to add a pdf representation to this digital object, but we wanted it to be a combination of all of the pages instead of one pdf per page. How does that look like?
000100tp.tif 000100tp.txt 00020000.tif 00020000.txt 00030001.tif 00030001.txt book.pdf
See how those don’t group nicely anymore? Well we would just break that pdf out into its own manifestation.
manifestation_1 fileSet_1 000100tp.tif 000100tp.txt fileSet_2 00020000.tif 00020000.txt fileSet_3 00030001.tif 00030001.tif manifestation_2 fileSet_1 book.pdf
So now the digital object from a little higher looks like this
ark:/67531/metapth10232 descriptive_metadata manifestation_1 fileSet_1 fileSet_2 fileSet_3 manifestation_2 fileSet_1
One final example, at UNT we are the digital archive for the Texas Register which is deposited with us by the Texas Secretary of States Office each week. It comes to us in a few different formats which they create and we want to keep track of.
Each issue of the register will come to us as a single pdf file, a collection of html files and possible a many txt files. Additionally we convert the pdf into tiff images so that it can work well in our delivery system. We have the desire to package all of this content together in a single digital object. So an object could look like this.
ark:/67531/metapth88989 descriptive_metadata manifestation_1 (tiff version) fileSet_1 fileSet_2 fileSet_3 ... fileSet_n manifestation_2 (pdf version) fileSet_1 manifestation_3 (html version) fileSet_1 fileSet_2 fileSet_3 ... fileSet_n manifestation_4 (txt version) fileSet_1 fileSet_2 fileSet_3 fileSet_4
You can see that this puts things in their nice neat places. Lets go back to our first example of the scanned book. So lets say we want to do some more things with this book. Say for example we want to put it online in a delivery system, we want to provide a square, thumbnail, medium, and large size jpeg version of each page. In addition to the OCR txt file we decided we wanted to save the raw output of the OCR engine in the alto format for processing in the future. We might also decide we wanted a lighter weight file for holding bounding box information for each word on the page (highlighting search terms is cool right?)
So what does our filelist look like now for that book?
fileSet_1 000100tp.tif 000100tp.alto.xml 000100tp.bboxes.xml 000100tp.txt 000100tp.medium.jpg 000100tp.jpg 000100tp.thumbnail.jpg 000100tp.square.jpg fileSet_2 0002000.tif 0002000.alto.xml 0002000.bboxes.xml 0002000.txt 0002000.medium.jpg 0002000.jpg 0002000.thumbnail.jpg 0002000.square.jpg fileSet_3 00030001.tif 00030001.alto.xml 00030001.bboxes.xml 00030001.txt 00030001.medium.jpg 00030001.jpg 00030001.thumbnail.jpg 00030001.square.jpg
This gets pretty big pretty quick doesn’t it?
If you think about it, there are many fileSets that could use a few additional bits of information as well, for example each of the pages in a book has a page number, an audio recording has a “track name”, an image might have an annotation and the likes. Also we want a way to define a sequence of fileSets just in case you run into this horrible situation.
1999.001.001_back.tif 1999.001.001_front.tif
Ah, natural sort and what do most systems end up doing? Most will put the back image in first followed by the front. But that’s really a naming problem…
What if we thought of those as if they were properties of a fileset?
fileSet_1.sequence = 1 fileSet_1.pageNumber = "Title Page" fileSet_1.label = None fileSet_2.sequence = 2 fileSet_2.pageNumber = None fileSet_2.label = None fileSet_3.sequence = 3 fileSet_3.pageNumber = 1 fileSet_3.label = None
So now that we know a little information about the fileSets in the book, we also want to know some information about each of the files that make up a fileSet, lets just take one of those fileSets as an example.
I’m interested in knowing the following things about each file in our digital object
- filename
- size in bytes
- mimetype
- checksum
- checksum type (md5, sha1)
- unique identifier (guid, uuid, ark)
- use (is this a ocr file? bounding box file? master tif? large jpeg?)
There are also a few other things that I am interested in at the file level, that would be additional information that we use in the curation of the digital object. For us they look like this
- PREMIS file level metadata
- JHOVE output stream
- File output stream
Those of you who know the various tools in the digital library tool-chest are probably saying “hey that kind of sounds like METS” and in fact it does sound like METS. The way we went around designing our digital object wasn’t “how would we do this in x technology or y system” it was thinking about the types of content we wanted to handle and how we wanted to manage that content. We’ve decided to serialize our digital objects model using METS, and it works pretty well for that.