What do we put in our BagIt bag-info.txt files?

The UNT Libraries makes heavy use of the BagIt packaging format throughout our digital repository infrastructure.  I’m of the opinion that BagIt is one of the technologies that has contributed more toward moving digital preservation forward in the last ten years than any other one technology/service/specification.  The UNT Libraries uses BagIt for our Submission Information Packages (SIP),  our Archival Information Packages (AIP), our Dissemination Information Packages, and our local Access Content Package (ACP).

For those that don’t know BagIt,  it is a set of conventions for packaging content into a directory structure in a consistent and repeatable way.  There are a number of other descriptions of BagIt that do a very good job of describing the conventions and some of the more specific bits of the specification.

There are a number of great tools for creating, modifying and validating BagIt bags,  and my favorite for a long time has been bagit-python from the Library of Congress.   (To be honest I usually am using Ed Summers fork which I grab from here)

The BagIt specification has a metadata file that is stored in the root of a bag,  this metadata file is called bag-it.txt.  The BagIt specification has a number of fields defined for this file which are stored as key value pairs in the file in the format of.

key: value

I thought it might be helpful for those new to using BagIt bags to see what kinds of information we are putting into these bag-info.txt files,  and also explain some of the unique fields that we are adding to the file for managing items in our system.  Below is a typical bag-info.txt file from one of our AIPs in the Coda Repository.

Bag-Size: 28.32M
Bagging-Date: 2015-01-23
CODA-Ingest-Batch-Identifier: f2dbfd7e-9dc5-43fd-975a-8a47e665e09f
CODA-Ingest-Timestamp: 2015-01-22T21:43:33-0600
Contact-Email: mark.phillips@unt.edu
Contact-Name: Mark Phillips
Contact-Phone: 940-369-7809
External-Description: Collection of photographs held by the University of North
 Texas Archives that were taken by Junebug Clark or other family
 members. Master files are tiff images.
External-Identifier: ark:/67531/metadc488207
Internal-Sender-Identifier: UNTA_AR0749-002-0016-0017
Organization-Address: P. O. Box 305190, Denton, TX 76203-5190
Payload-Oxum: 29666559.4
Source-Organization: University of North Texas Libraries

In the example above,  several of the fields are boiler plate, and others are machine generated.

Field How we create the Value
Bag-Size Machine
Bagging-Date Machine
CODA-Ingest-Batch-Identifier Machine
CODA-Ingest-Timestamp Machine
Contact-Email Boiler-Plate
Contact-Name Boiler-Plate
Contact-Phone Boiler-Plate
External-Description Changes per “collection”
External-Identifier Machine
Internal-Sender-Identifier Machine
Organization-Address Boiler-Plate
Payload-Oxum Machine
Source-Organization Boiler-Plate

You can tell from looking at the example bag-info.txt file above that some of the fields are very self explanatory.  I’m going to run over a few of the fields that either are non-standard, or that we’ve made explicit decisions on as we were implementing BagIt.

CODA-Ingest-Batch-Identifier is a UUID for each batch of content added to our Coda Repository,  this helps us identify other items that may have been added during a specific run of our ingest process,  helpful for troubleshooting.

CODA-Ingest-Timestamp is the timestamp when the AIP was added to the Coda Repository.

External-Identifier will change for each collection that gets processed,  it has just enough information about the collection to help jog someone’s memory about where this item came from and why it was created.

External-Identifier is the ARK identifier assigned the item on ingest into one of the Aubrey systems where we access the items or manage the descriptive metadata.

Internal-Sender-Identifier is the locally important (often not unique) identifier for the item as it is being digitized or collected.  It often takes the shape of an accession number from our University Special Collections, or the folder name of an issue of newspaper.

We currently have 1,070,180 BagIt bags in our Coda Repository and they have be instrumental in us being able to scale our digital library infrastructure and verify that each item is just the same as when we added it to our collection.

If you have any specific questions for me let me know on twitter.