How we use BagIt to ingest and store digital objects.

At the UNT Libraries we use BagIt as a key component of our digital strategy. It is a simple specification which doesn’t get in your way but offers a few niceties which come in quite handy. Let me walk you through three different bags that we have in our system, they are the SIP, AIP and ACP (Access Content Package)

In the last post I described the layout of our SIPs, here it is to refresh your memory.

sip
|-- 0=UNTL_SIP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- 01_tif
|   |   |-- 1999.001.001_01.tif
|   |   `-- 1999.001.001_02.tif
|   `-- metadata.xml
`-- manifest-md5.txt

The ingest process starts by submitting the SIP to one of the different dropboxes we have setup for the ingest process. The difference in the dropboxes is just in how the names for objects are assigned, one dropbox per namespace (we currently name under a metapth, metadc and metarkv namespace)

Dropboxes setup the steps involved in the ingest process of our digital content. They are responsible for creating the AIPs and ACPs for our system. The first step is the creation of an AIP, we designed a packaging script called AIPmaker.py which does the heavy lifting in this process. It takes the input SIP, opens up the coda_directives.py file, finds what’s in the data directory and starts processing. For each file in the data directory there is a PREMIS steam, JHOVE stream and File stream saved either as a file or included in the METS record. This is what the resulting AIP looks like.

aip
|-- 0=UNTL_AIP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- data
|   |   `-- 01_tif
|   |       |-- 1999.001.001_01.tif
|   |       `-- 1999.001.001_02.tif
|   |-- metadata
|   |   |-- 17711fdb-2e25-4566-bf3f-daa172a12190.jhove.xml
|   |   `-- 9639acae-397a-4c90-8851-52b6f04c4d8d.jhove.xml
|   |-- metadata.xml
|   `-- metapth1234.aip.mets.xml
`-- manifest-md5.txt

AIPmaker.py has a management script called makeAIP.py which is responsible for a few important things. This script does a series of checks on the data before it starts processing content. Keep in mind that the data was moved to this ingest system from another system, network hops require validation of bags in my book. Here are the steps that makeAIP.py script takes.

  1. Checks to see that the proper utilities are installed on the ingest server.
  2. Check that the number server is online (number server provides object names)
  3. Check the Namaste tag in the bag to make sure it is a SIP bag
  4. Fully validate the bag.
  5. Get a new object name from the numberserver

The dropbox is organized like this:

dropbox/
|-- 0.Staging
|-- 1.ToAIP
|-- 2.ToAIP-Error
|-- 3.ToACP
|-- 4.ToACP-Error
|-- 5.ToArchive
|-- 6.ToAubrey
|-- 7.ToAubreySorted
|-- 8.ToAubreySorted-Error
|-- dropbox_config.py
|-- makeACP.py
|-- makeACPSort.py
|-- makeAIP.py
`-- moveToCODA-002.sh

Content is loaded into the 1.ToAIP folder, if makeAIP fails on check 1 or 2 in the above list, the whole process stops, if they pass then makeAIP will start n AIPmaker processes to take advantage of available processors on the ingest server. If there is an error in one of the 3,4 or 5 checks in the list above, the process moves the object to the 2.ToAIP-Error folder and continues. When makeAIP is finished running it will let the operator know how many SIPs were processed, the number that were successful and the number that failed.

Objects that are successful processed are now AIPs and they get moved to the 3.ToACP folder. A very similar process is run called makeACP.py which does the following steps:

  1. Checks to see that the proper utilities are installed on the ingest server.
  2. Check the Namaste tag in the bag to make sure it is a AIP bag
  3. Check Oxum for the bag just as a sanity check, content didn’t move volumes so this is sufficient.

The ACP is the package that goes to the public delivery system, instead of large tiff images, wav files or low compression video, we want to convert content into Web size version, this is done in a standardized way in the ACPmaker. The created ACP looks like this:

acp
|-- 0=UNTL_ACP_1.0
|-- bag-info.txt
|-- bagit.txt
|-- coda_directives.py
|-- data
|   |-- metapth1234.mets.xml
|   |-- metapth1234.untl.xml
|   `-- web
|       `-- 01_tif
|           |-- 1999.001.001_01.jpg
|           |-- 1999.001.001_01.medium.jpg
|           |-- 1999.001.001_01.square.jpg
|           |-- 1999.001.001_01.thumbnail.jpg
|           |-- 1999.001.001_02.jpg
|           |-- 1999.001.001_02.medium.jpg
|           |-- 1999.001.001_02.square.jpg
|           `-- 1999.001.001_02.thumbnail.jpg
`-- manifest-md5.txt

The ACP if successfully created moves to the 6.ToAubrey folder and the AIP that was being processed moves to 5.ToArchive folder.

The last two steps are to move the ACP to the delivery system which we call Aubrey, and then the AIP to the archival system which we call CODA. I’ll post on those at a later point.

Finally, I’ve gotten a few questions about how much hand holding there is with this ingest process so I thought I’d try to shed a little light on that. I typically try to run batches in sets of over 100, this is an arbitrary number but it is typically enough that you can start a process, go get something else done and come back to things and be ready to start the next process. Below is an idea of what I run and how it looks on the command line. The times below are for about 100 issues of newspapers which are roughly 180-240 MB in size per issuse.

	
	#Create sip locally ~1 min to create sips
	local-desktop: python sipmaker.py -info-file=template.txt -suffix=tif 1999.001.001/
	
	#Move to ingest servers with rsync at ~35MB per sec
	local-desktop: bash move-data-to-dc_dropbox.sh
	
	#Create AIP with 3 worker processes ~15 min
	ingest-server: python makeAIP.py
	
	#Create ACP with 5 worker processes ~45 min 
	ingest-server: python makeACP.py
	
	#Sort ACPs into pairtree ~20 sec
	ingest-server: python makeACPSort.py
	
	#Move to coda system with rsync at ~35MB per sec
	ingest-server: bash move-data-to-coda-002.sh
	
	#Move to aubrey system with rsync at ~35MB per sec
	ingest-server: bash move-data-to-aubrey.sh

I didn’t mention the makeACPSort.py script yet, all this script does is to take an ACP and put it into a PairTree structure that is used by Aubrey for organizing files on its filesystem.

In another post I’ll talk about what happens in the AIPmaking step, the ACP making step and then what happens when things get to either coda or aubrey.

This post starts to mention the use of some specifications that are being developed by the California Digital Library which fall into their “micro services” area, here is a list of the mentioned specifications for quick references.

5 thoughts on “How we use BagIt to ingest and store digital objects.

  1. Brian Kennison

    Mark – Thanks for this series of posts! I’ve been looking at your site and on your TRAC wiki but these posts are much clearer. Thanks for sharing your methods and processes.

    I was wondering about how (and how often) editing takes place. If a change needs to be made is the AIP retrieved, edited, and re-submitted as a DIP? Does the ingest process look for objects that already have an id?

    Reply
  2. vphill Post author

    Yes the Trac wiki isn’t very clear.

    We view coda (our digital archive thingy) as a write once system. If there are changes that need to happen to an object, it really depends on what the change is. Here are some changes that don’t require re-ingest

    * descriptive metadata editing
    * re-ordering of fileSets (page in a book out of order)

    If there is a change to a file (someone misspelled their name on an ETD) or a missing page from a book, then the process is to grab the AIP from coda (it becomes a DIP at that point but the structure is the same), change what is needed and then load it back into the ingest system. We have one more dropbox which is a dropbox that doesn’t assign a new identifier and is used for both legacy content ingest as well as adding new versions into the system. coda will just maintain two “versions” of this object if it is re-ingested.

    Does that help?

    Reply
  3. Brian Kennison

    Yes that helps. It expalains how the archive handles versions, essentially another manifestation). When a new version is ingested, I guess a new ACP is generated and overwrites the old ACP. Do you keep versions of the descriptive metadata if it changes? How/When are the different namespaces used?

    Reply
  4. vphill Post author

    Descriptive metadata is handled in our delivery platform, called Aubrey, I’ll do a few posts about that in the near future.

    We version all metadata modifications to both the structure of the object (ordering of fileSets, adding labels to manifestations and fileSets, page numbers and such to a digital objects) as well as version all descriptive metadata edits. This is helpful to be able to roll back changes that shouldn’t have been made as well as figuring out what happened if something that wasn’t supposed to be shared publicly got shared. As we look at raising the quality of our metadata I think the ability to version all of it is one of the measures that plays into it.

    new ACPs can be run at anytime (for example if we decide to encode at a higher bitrate for video or audio, change sizes or just screw up on the delivery formats) by placing an AIP in the 3.ToACP folder in the desired dropbox and running the makeACP.py script again, it would then just overwrite the copy in aubrey (taking care not to write over the descriptive metadata , filenaming conventions help with this)

    Reply
  5. Brian Kennison

    Thanks for the info! I’ve got some things in place but have a lot to do. I buy into the ideas expressed by the CDL of “permanent objects and disposable systems” and I’ve been trying to get a outline of what this repository system would look like. Your posts are most helpful.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *