Monthly Archives: March 2010

Some examples of hackable identifiers in the UNT Digital Library

The UNT Digital Library uses the ARK identifier scheme as part of its persistent identifier strategy, below are some of the ways that a user can interact or hack a url.

An object is identified with an ARK added to the http://digital.library.unt.edu domain

ark:/67531/metadc11772

becomes

http://digital.library.unt.edu/ark:/67531/metadc11772/

First thing that can be done is to ask for basic metadata about the object at this ARK

http://digital.library.unt.edu/ark:/67531/metadc11772/?

Committment statement about this ARK

http://digital.library.unt.edu/ark:/67531/metadc11772/??

All objects have a thumbnail image associated with them

http://digital.library.unt.edu/ark:/67531/metadc11772/thumbnail/

All objects have a citation page with them

http://digital.library.unt.edu/ark:/67531/metadc11772/citation/

Metadata in a few formats is avaliable at a metadata splash page

http://digital.library.unt.edu/ark:/67531/metadc11772/metadata/

and then formats are avaliable as well

http://digital.library.unt.edu/ark:/67531/metadc11772/metadata.dc.xml

Our data model has the idea of an object, which is made up of manifestations, which in turn are made up of fileSets. Here is how those look in the url

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/ – manifestation which is a series of image files
http://digital.library.unt.edu/ark:/67531/metadc11772/m2/ – manifestation which is single pdf

Each manifestation is made up of fileSets

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/1/ – first fileSet in the first manifestation
http://digital.library.unt.edu/ark:/67531/metadc11772/m1/2/ – second fileSet in the first manifestation
http://digital.library.unt.edu/ark:/67531/metadc11772/m2/1/ – first fileSet in the second manifestation

We have a view that allows you to see all of the manifestations in an object as well.

http://digital.library.unt.edu/ark:/67531/metadc11772/m/

FileSets can have additional views to them, for example:

an OCR view

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/2/ocr/

small resolution view

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/2/small_res/

medium resolution view

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/2/med_res/

high resolution view

http://digital.library.unt.edu/ark:/67531/metadc11772/m1/2/high_res/

you can force a browser to download one of the views

http://digital.library.unt.edu/ark:/67531/metadc11772/m2/1/high_res_d

Some of the objects have different functionality than others, for example video fileSets have the ability to psuedo-stream to flash players, image fileSets have the ability to resize, rotate and adjust the aspect ratio of the image back in the application before it is sent to the browser.

When we need a new bit of functionality it is mapped to the logical place within our current identifier structure.

In the end though we consider everything past the name in the ARK to be optional to support and while we fully expect to resolve these urls to very similar functionality as our system moves forward if something changes we will redirect the URLs back to the base ARK.

Aubrey: Our content delivery/metadata management system.

Aubrey is what we decided to call the system responsible for a variety of tasks which are all public facing. This system handles both content delivery as well as allowing collection owners to edit metadata and object records as needed.

You can see the content delivery side of Aubrey in the following two systems:

The creation of Aubrey was funded by a generous grant from the Institute of Museum and Library Services as part of a national leadership grant to design and develop a rapid development framework for interface development in digital libraries. The grant was written because all too often in digital libraries there is a disconnect between what the users of digital libraries request in terms of user interfaces and feature and what we are able to provide them given the digital library platform we are trying to use.

There are several reasons we chose to roll our own content delivery system, as time goes on I keep adding to this list, some are of more importance than others but all added up to us creating our own content delivery and metadata management system.

Here are some of the reasons we decided to do this. Please don’t take any of these as an attack on any systems, most likely they can do the things that we are trying to do but I wasn’t smart enough to figure out how to make them bend in the way needed…

  • We wanted a system built on a modern Web framework that takes care of most of the fiddly bits of Web application building.
  • We wanted to work in Python, we use Python, we love Python, we get stuff done in Python.
  • We wanted a system that was itself loosely coupled so we could swap things out as technologies changed.
  • We wanted one system to serve all digital content, audio, video, books, newspapers, data sets, photographs, stuff…
  • We wanted a system that could scale horizontally in respect to handling more requests and adding more content.
  • We wanted redundancy in as many areas as possible so if one server goes down the whole application still runs.
  • We wanted to let our UI people do the user interface, ie. we wanted to use a templating language that our UI people wanted to work with.
  • We had a specific idea of how we wanted to deliver content to users and how those ideas mapped to urls, so a framework that promoted beautiful urls was ideal.
  • We wanted a huge community solving most of the problems with a big Web application (most of those problems aren’t library problems)
  • We wanted to treat the application as a replaceable set of services that sits on top of our own digital object model. We really like the idea of “permanent objects, disposable systems”.
  • We wanted a system than would allow for metadata more complex than simple key value pairs.
  • We wanted to manage the editing interface for the 120+ partner institutions that use our infrastructure.

In the end we arrived at a system that makes use of these components

  • Python
  • Django
  • Ubuntu
  • Perlbal
  • Memcached
  • OpenLayers
  • JQuery
  • Solr
  • AtomPub

The digital objects this collection of technologies sits atop use the following standards and specifications:

  • METS
  • Dublin Core – (Locally Qualified)
  • PairTree
  • Namaste
  • BagIt

Aubrey is the name of a town north of Denton and served us with a name that is as good as any other name.

So we had support, funding, people, ideas of what we wanted and a set of technologies we were all excited to work with. This turned out to be quite fun.

Aubrey was built over the two year period and consisted of three major releases, during each release the underlying objects never changed, and in fact the plan is to never change the digital objects that sit on disk, there are just too many for us to reasonably touch and most likely the changes aren’t worth that overhead. We built Aubrey around an abstraction of what a digital object is to us in the system, that internal abstraction can change easily as it is only ever stored in cache, that’s where stuff changes.

So anyway, Aubrey meet people, people meet Aubrey.

I’ll flesh out the under the hood stuff in following posts, if I can figure out how to use some of this new software to draw pictures you might even have graphics.

How we use BagIt to ingest and store digital objects.

At the UNT Libraries we use BagIt as a key component of our digital strategy. It is a simple specification which doesn’t get in your way but offers a few niceties which come in quite handy. Let me walk you through three different bags that we have in our system, they are the SIP, AIP and ACP (Access Content Package)

In the last post I described the layout of our SIPs, here it is to refresh your memory.

sip
|-- 0=UNTL_SIP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- 01_tif
|   |   |-- 1999.001.001_01.tif
|   |   `-- 1999.001.001_02.tif
|   `-- metadata.xml
`-- manifest-md5.txt

The ingest process starts by submitting the SIP to one of the different dropboxes we have setup for the ingest process. The difference in the dropboxes is just in how the names for objects are assigned, one dropbox per namespace (we currently name under a metapth, metadc and metarkv namespace)

Dropboxes setup the steps involved in the ingest process of our digital content. They are responsible for creating the AIPs and ACPs for our system. The first step is the creation of an AIP, we designed a packaging script called AIPmaker.py which does the heavy lifting in this process. It takes the input SIP, opens up the coda_directives.py file, finds what’s in the data directory and starts processing. For each file in the data directory there is a PREMIS steam, JHOVE stream and File stream saved either as a file or included in the METS record. This is what the resulting AIP looks like.

aip
|-- 0=UNTL_AIP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- data
|   |   `-- 01_tif
|   |       |-- 1999.001.001_01.tif
|   |       `-- 1999.001.001_02.tif
|   |-- metadata
|   |   |-- 17711fdb-2e25-4566-bf3f-daa172a12190.jhove.xml
|   |   `-- 9639acae-397a-4c90-8851-52b6f04c4d8d.jhove.xml
|   |-- metadata.xml
|   `-- metapth1234.aip.mets.xml
`-- manifest-md5.txt

AIPmaker.py has a management script called makeAIP.py which is responsible for a few important things. This script does a series of checks on the data before it starts processing content. Keep in mind that the data was moved to this ingest system from another system, network hops require validation of bags in my book. Here are the steps that makeAIP.py script takes.

  1. Checks to see that the proper utilities are installed on the ingest server.
  2. Check that the number server is online (number server provides object names)
  3. Check the Namaste tag in the bag to make sure it is a SIP bag
  4. Fully validate the bag.
  5. Get a new object name from the numberserver

The dropbox is organized like this:

dropbox/
|-- 0.Staging
|-- 1.ToAIP
|-- 2.ToAIP-Error
|-- 3.ToACP
|-- 4.ToACP-Error
|-- 5.ToArchive
|-- 6.ToAubrey
|-- 7.ToAubreySorted
|-- 8.ToAubreySorted-Error
|-- dropbox_config.py
|-- makeACP.py
|-- makeACPSort.py
|-- makeAIP.py
`-- moveToCODA-002.sh

Content is loaded into the 1.ToAIP folder, if makeAIP fails on check 1 or 2 in the above list, the whole process stops, if they pass then makeAIP will start n AIPmaker processes to take advantage of available processors on the ingest server. If there is an error in one of the 3,4 or 5 checks in the list above, the process moves the object to the 2.ToAIP-Error folder and continues. When makeAIP is finished running it will let the operator know how many SIPs were processed, the number that were successful and the number that failed.

Objects that are successful processed are now AIPs and they get moved to the 3.ToACP folder. A very similar process is run called makeACP.py which does the following steps:

  1. Checks to see that the proper utilities are installed on the ingest server.
  2. Check the Namaste tag in the bag to make sure it is a AIP bag
  3. Check Oxum for the bag just as a sanity check, content didn’t move volumes so this is sufficient.

The ACP is the package that goes to the public delivery system, instead of large tiff images, wav files or low compression video, we want to convert content into Web size version, this is done in a standardized way in the ACPmaker. The created ACP looks like this:

acp
|-- 0=UNTL_ACP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- metapth1234.mets.xml
|   |-- metapth1234.untl.xml
|   `-- web
|       `-- 01_tif
|           |-- 1999.001.001_01.jpg
|           |-- 1999.001.001_01.medium.jpg
|           |-- 1999.001.001_01.square.jpg
|           |-- 1999.001.001_01.thumbnail.jpg
|           |-- 1999.001.001_02.jpg
|           |-- 1999.001.001_02.medium.jpg
|           |-- 1999.001.001_02.square.jpg
|           `-- 1999.001.001_02.thumbnail.jpg
`-- manifest-md5.txt

The ACP if successfully created moves to the 6.ToAubrey folder and the AIP that was being processed moves to 5.ToArchive folder.

The last two steps are to move the ACP to the delivery system which we call Aubrey, and then the AIP to the archival system which we call CODA. I’ll post on those at a later point.

Finally, I’ve gotten a few questions about how much hand holding there is with this ingest process so I thought I’d try to shed a little light on that. I typically try to run batches in sets of over 100, this is an arbitrary number but it is typically enough that you can start a process, go get something else done and come back to things and be ready to start the next process. Below is an idea of what I run and how it looks on the command line. The times below are for about 100 issues of newspapers which are roughly 180-240 MB in size per issuse.

	
	#Create sip locally ~1 min to create sips
	local-desktop: python sipmaker.py -info-file=template.txt -suffix=tif 1999.001.001/
	
	#Move to ingest servers with rsync at ~35MB per sec
	local-desktop: bash move-data-to-dc_dropbox.sh
	
	#Create AIP with 3 worker processes ~15 min
	ingest-server: python makeAIP.py
	
	#Create ACP with 5 worker processes ~45 min 
	ingest-server: python makeACP.py
	
	#Sort ACPs into pairtree ~20 sec
	ingest-server: python makeACPSort.py
	
	#Move to coda system with rsync at ~35MB per sec
	ingest-server: bash move-data-to-coda-002.sh
	
	#Move to aubrey system with rsync at ~35MB per sec
	ingest-server: bash move-data-to-aubrey.sh

I didn’t mention the makeACPSort.py script yet, all this script does is to take an ACP and put it into a PairTree structure that is used by Aubrey for organizing files on its filesystem.

In another post I’ll talk about what happens in the AIPmaking step, the ACP making step and then what happens when things get to either coda or aubrey.

This post starts to mention the use of some specifications that are being developed by the California Digital Library which fall into their “micro services” area, here is a list of the mentioned specifications for quick references.

Packaging content into Submission Information Packages

In the previous post I described the digital object model in use at the UNT Libraries for our digital content. This post is going to describe the process of getting content into our model.

First off, and I say this because it gets asked all the time, we do not by any means do the METS creation manually, far from it in fact. Lets start walking through the things we have setup.

Lets talk about a collection of images for this example.

Say you digitized 100 images for a project, they all came to you with identifiers that were in some way important to the owners of the collection. Here is how the files look on disk when you are finished with them.

1999.001.001
  1999.001.001_01.tif
  1999.001.001_02.tif
  metadata.xml
1999.001.002
  1999.001.002_01.tif
  1999.001.002_02.tif
  metadata.xml
1999.002.001
  1999.002.001_01.tif
  1999.002.001_02.tif
  metadata.xml

What you have is three of the scanned photographs. You name a folder after the accession number the collection owners provided. Within each folder you will see two tif files, one for the front and one for the back of the photo. _01 is front and _02 is the back. Finally you will see a metadata.xml file which is the metadata record that was created when you scanned the image. It rides along with the digital object and not in some spreadsheet somewhere where it has to get matched up later.

The first step we have is to normalize input content a little so that we can move it into an ingest system. This normalization is actually the creation of a Submission Information Package (SIP) for our system. Here is what a normalized folder structure would look like.

1999.001.001
  01_tif
    1999.001.001_01.tif
    1999.001.001_02.tif
  metadata.xml
1999.001.002
  01_tif
    1999.001.002_01.tif
    1999.001.002_02.tif
  metadata.xml
1999.002.001
  01_tif
    1999.002.001_01.tif
    1999.002.001_02.tif
  metadata.xml

All we did was to add another level of hierarchy to the structure. This 01_tif you see is a combination of two things, first is the 01 part which defines the manifestation order and the second part _tif is the label for the manifestation.

Here is an example structure with multiple manifestations from the last post.

0102is
  01_tif
    00001000tp.tif
    0000200000.tif
    0000300001.tif
    0000400002.tif
  02_pdf
    0102is.pdf
  03_html
    index.html
    page1.html
    page2.html
  04_txt
   0102is-introduction.txt
   0102is-table_of_contents.txt
  metadata.xml

In this example we have four manifestations, tif, pdf, html and txt, the order is defined by the initial sequence in the folder names.

What this input normalization does is to take the arbitrary organization that exists and put the bare minimum of control on it so that we can start to do some automated processing.

There are two other pieces of information that should be passed along about the SIPs that we create.

We decided about a year and a half ago that we would use the BagIt model for storing our content, we actually create bags in this SIP making stage. Here is what a complete SIP actually looks like for one of those photographs.

1999.001.001
  0=untl-sip-0.1
  bag-info.txt
  bagit.txt
  coda_directives.py
  data/
    01_tif
      1999.001.001_01.tif
      1999.001.001_02.tif
    metadata.xml
  manifest-md5.txt

Think of the data/ directory as the digital object folder, the folders inside of it correspond to manifestations in the digital objects, files would get grouped into fileSets and there is that metadata.xml file still hanging out in the object as well.

There is one other thing that has crept into the example and that is the coda_directives.py file.

This is actually a way to control the processing of objects later in the chain, it is a python data structure which has a set of instructions per manifestation which are uncommented to effect the processing. Some of the values are:

  • does this manifestation use magicknumber ? (a way of representing page order and page number in the filename)
  • does this manifestation follow the UNT method of creating fileSets? (basically grouping on the first part of the filename)
  • what do you want the manifestation label to be?
  • what size do you want the large Web-size to be? (1200 or 1500 px across typically)
  • do you want to create tiles for this manifestations (for a zooming interface)

It is easy to add new directives as they are needed.

One last thing. There is a tool we created that automates the process of normalizing things. The tool is called sipmaker.py and has two options, one is to assign a base info-file for use as the bag-info.txt file that gets added as part of the bag and then a suffix flag where you can specify the name of the manifestation you are processing. If you already sorted your input directory into folders you don’t have to specify the suffix, but if there are just a bunch of files living in the root of your object folder, the suffix will package them all into a 01_{suffix} folder.

This makes it easy to package many objects in a batch process for submission into our ingest system.