Monthly Archives: February 2010

Digital Object Model

This is the first in a series of posts about some of the infrastructure that sits behind the UNT Libraries’ digital library initiatives. That infrastructure looks like this to the public. http://texashistory.unt.edu/ and http://digital.library.unt.edu/

The digital object model we use at the UNT Libraries is modeled after digital objects we’ve had contact with over the past five years. Additionally we established use cases for several kinds of objects that had previously been unsupported by our previous data model, all of this information informed our thinking when it came to modeling and kept us on task.

In the UNTL model, a digital object is described by a single descriptive metadata file. The model doesn’t specify a specific metadata format, however all of our systems have been built around the format we call UNTL, which is a locally qualified Dublin Core metadata model with XML, Python and JSON representations.

Each digital object has one or more manifestations. A manifestation can be thought of as a “format” or type, if pressed it isn’t too hard to map this over to the FRBR idea of manifestation, at least in my mind.

Each manifestation contains one or more fileSets. A fileSet is a grouping of files that are related in some logical way.

Lets jump into an example to see how this works.

Lets say we have a book we want to digitize. We decide to scan each of the pages in the book, we want to OCR the page of the book so we can search the full-text in some undisclosed system. Already we can see that we have two files that have a relation to each other, the master tiff image and the OCR text file

Here is a short list of what those files could look like.

 
 000100tp.tif
 000100tp.txt
 00020000.tif
 00020000.txt
 00030001.tif
 00030001.txt
 ...

You can see how things pair up in the example. Lets call each of these groups a fileSet.

 fileSet_1
    000100tp.tif
    000100tp.txt
 
 fileSet_2
    00020000.tif
    00020000.txt

 fileSet_3
    00030001.tif
    00030001.txt

Each of these fileSets in our example belong to one manifestation, basically a manifestation that is a bunch of tiff files with OCR.

 
  manifestation_1
   fileSet_1
   fileSet_2
   fileSet_3

We choose to give the object as a whole a descriptive metadata record which describes the content in the digital object.

 descriptive_metadata
 manifestation_1
   fileSet_1
   fileSet_2
   fileSet_3

For good measure we want to give the digital object a globally unique identifier so that we never get it mixed up with anything else.

 
ark:/67531/metapth10232
 descriptive_metadata
 manifestation_1
   fileSet_1
   fileSet_2
   fileSet_3

Well that’s fine and all but it really isn’t that interesting.

Nope not really at all.

So lets get a little more into this.

What if we wanted to add a pdf representation to this digital object, but we wanted it to be a combination of all of the pages instead of one pdf per page. How does that look like?

 000100tp.tif
 000100tp.txt
 00020000.tif
 00020000.txt
 00030001.tif
 00030001.txt
 book.pdf

See how those don’t group nicely anymore? Well we would just break that pdf out into its own manifestation.

 manifestation_1
   fileSet_1
     000100tp.tif
     000100tp.txt
   fileSet_2
     00020000.tif
     00020000.txt
   fileSet_3
     00030001.tif
     00030001.tif
 manifestation_2
   fileSet_1
     book.pdf

So now the digital object from a little higher looks like this

ark:/67531/metapth10232
 descriptive_metadata
 manifestation_1
   fileSet_1
   fileSet_2
   fileSet_3
 manifestation_2
   fileSet_1

One final example, at UNT we are the digital archive for the Texas Register which is deposited with us by the Texas Secretary of States Office each week. It comes to us in a few different formats which they create and we want to keep track of.

Each issue of the register will come to us as a single pdf file, a collection of html files and possible a many txt files. Additionally we convert the pdf into tiff images so that it can work well in our delivery system. We have the desire to package all of this content together in a single digital object. So an object could look like this.

ark:/67531/metapth88989
 descriptive_metadata
 manifestation_1 (tiff version)
  fileSet_1
  fileSet_2
  fileSet_3
  ...
  fileSet_n
 manifestation_2 (pdf version)
  fileSet_1
 manifestation_3 (html version)
  fileSet_1
  fileSet_2
  fileSet_3
  ...
  fileSet_n
 manifestation_4 (txt version)
  fileSet_1
  fileSet_2
  fileSet_3
  fileSet_4

You can see that this puts things in their nice neat places. Lets go back to our first example of the scanned book. So lets say we want to do some more things with this book. Say for example we want to put it online in a delivery system, we want to provide a square, thumbnail, medium, and large size jpeg version of each page. In addition to the OCR txt file we decided we wanted to save the raw output of the OCR engine in the alto format for processing in the future. We might also decide we wanted a lighter weight file for holding bounding box information for each word on the page (highlighting search terms is cool right?)

So what does our filelist look like now for that book?

fileSet_1
 000100tp.tif
 000100tp.alto.xml
 000100tp.bboxes.xml
 000100tp.txt
 000100tp.medium.jpg
 000100tp.jpg
 000100tp.thumbnail.jpg
 000100tp.square.jpg

fileSet_2
 0002000.tif
 0002000.alto.xml
 0002000.bboxes.xml
 0002000.txt
 0002000.medium.jpg
 0002000.jpg
 0002000.thumbnail.jpg
 0002000.square.jpg

fileSet_3
 00030001.tif
 00030001.alto.xml
 00030001.bboxes.xml
 00030001.txt
 00030001.medium.jpg
 00030001.jpg
 00030001.thumbnail.jpg
 00030001.square.jpg

This gets pretty big pretty quick doesn’t it?

If you think about it, there are many fileSets that could use a few additional bits of information as well, for example each of the pages in a book has a page number, an audio recording has a “track name”, an image might have an annotation and the likes. Also we want a way to define a sequence of fileSets just in case you run into this horrible situation.

 1999.001.001_back.tif
 1999.001.001_front.tif

Ah, natural sort and what do most systems end up doing? Most will put the back image in first followed by the front. But that’s really a naming problem…

What if we thought of those as if they were properties of a fileset?

 fileSet_1.sequence = 1
 fileSet_1.pageNumber = "Title Page"
 fileSet_1.label = None
 
 fileSet_2.sequence = 2
 fileSet_2.pageNumber = None
 fileSet_2.label = None

 fileSet_3.sequence = 3
 fileSet_3.pageNumber = 1
 fileSet_3.label = None

So now that we know a little information about the fileSets in the book, we also want to know some information about each of the files that make up a fileSet, lets just take one of those fileSets as an example.

I’m interested in knowing the following things about each file in our digital object

  • filename
  • size in bytes
  • mimetype
  • checksum
  • checksum type (md5, sha1)
  • unique identifier (guid, uuid, ark)
  • use (is this a ocr file? bounding box file? master tif? large jpeg?)

There are also a few other things that I am interested in at the file level, that would be additional information that we use in the curation of the digital object. For us they look like this

  • PREMIS file level metadata
  • JHOVE output stream
  • File output stream

Those of you who know the various tools in the digital library tool-chest are probably saying “hey that kind of sounds like METS” and in fact it does sound like METS. The way we went around designing our digital object wasn’t “how would we do this in x technology or y system” it was thinking about the types of content we wanted to handle and how we wanted to manage that content. We’ve decided to serialize our digital objects model using METS, and it works pretty well for that.