Packaging content into Submission Information Packages

In the previous post I described the digital object model in use at the UNT Libraries for our digital content. This post is going to describe the process of getting content into our model.

First off, and I say this because it gets asked all the time, we do not by any means do the METS creation manually, far from it in fact. Lets start walking through the things we have setup.

Lets talk about a collection of images for this example.

Say you digitized 100 images for a project, they all came to you with identifiers that were in some way important to the owners of the collection. Here is how the files look on disk when you are finished with them.

1999.001.001
  1999.001.001_01.tif
  1999.001.001_02.tif
  metadata.xml
1999.001.002
  1999.001.002_01.tif
  1999.001.002_02.tif
  metadata.xml
1999.002.001
  1999.002.001_01.tif
  1999.002.001_02.tif
  metadata.xml

What you have is three of the scanned photographs. You name a folder after the accession number the collection owners provided. Within each folder you will see two tif files, one for the front and one for the back of the photo. _01 is front and _02 is the back. Finally you will see a metadata.xml file which is the metadata record that was created when you scanned the image. It rides along with the digital object and not in some spreadsheet somewhere where it has to get matched up later.

The first step we have is to normalize input content a little so that we can move it into an ingest system. This normalization is actually the creation of a Submission Information Package (SIP) for our system. Here is what a normalized folder structure would look like.

1999.001.001
  01_tif
    1999.001.001_01.tif
    1999.001.001_02.tif
  metadata.xml
1999.001.002
  01_tif
    1999.001.002_01.tif
    1999.001.002_02.tif
  metadata.xml
1999.002.001
  01_tif
    1999.002.001_01.tif
    1999.002.001_02.tif
  metadata.xml

All we did was to add another level of hierarchy to the structure. This 01_tif you see is a combination of two things, first is the 01 part which defines the manifestation order and the second part _tif is the label for the manifestation.

Here is an example structure with multiple manifestations from the last post.

0102is
  01_tif
    00001000tp.tif
    0000200000.tif
    0000300001.tif
    0000400002.tif
  02_pdf
    0102is.pdf
  03_html
    index.html
    page1.html
    page2.html
  04_txt
   0102is-introduction.txt
   0102is-table_of_contents.txt
  metadata.xml

In this example we have four manifestations, tif, pdf, html and txt, the order is defined by the initial sequence in the folder names.

What this input normalization does is to take the arbitrary organization that exists and put the bare minimum of control on it so that we can start to do some automated processing.

There are two other pieces of information that should be passed along about the SIPs that we create.

We decided about a year and a half ago that we would use the BagIt model for storing our content, we actually create bags in this SIP making stage. Here is what a complete SIP actually looks like for one of those photographs.

1999.001.001
  0=untl-sip-0.1
  bag-info.txt
  bagit.txt
  coda_directives.py
  data/
    01_tif
      1999.001.001_01.tif
      1999.001.001_02.tif
    metadata.xml
  manifest-md5.txt

Think of the data/ directory as the digital object folder, the folders inside of it correspond to manifestations in the digital objects, files would get grouped into fileSets and there is that metadata.xml file still hanging out in the object as well.

There is one other thing that has crept into the example and that is the coda_directives.py file.

This is actually a way to control the processing of objects later in the chain, it is a python data structure which has a set of instructions per manifestation which are uncommented to effect the processing. Some of the values are:

  • does this manifestation use magicknumber ? (a way of representing page order and page number in the filename)
  • does this manifestation follow the UNT method of creating fileSets? (basically grouping on the first part of the filename)
  • what do you want the manifestation label to be?
  • what size do you want the large Web-size to be? (1200 or 1500 px across typically)
  • do you want to create tiles for this manifestations (for a zooming interface)

It is easy to add new directives as they are needed.

One last thing. There is a tool we created that automates the process of normalizing things. The tool is called sipmaker.py and has two options, one is to assign a base info-file for use as the bag-info.txt file that gets added as part of the bag and then a suffix flag where you can specify the name of the manifestation you are processing. If you already sorted your input directory into folders you don’t have to specify the suffix, but if there are just a bunch of files living in the root of your object folder, the suffix will package them all into a 01_{suffix} folder.

This makes it easy to package many objects in a batch process for submission into our ingest system.

Leave a Reply

Your email address will not be published. Required fields are marked *