November | 2010 | mark e. phillips journal

In discussions of digital preservation services and workflows, the topic of aggregation comes up quite often. In fact this is one of those topics that tend to be discussed to death whenever a group of individuals representing various institutions come together to talk about digital content. I’ve noticed that at the end of these conversations we are usually in about the same place as we were when we started because of the simple fact that each institution has its own idea of how they manage content.

The problem of aggregations comes up immediately when we discuss moving content between institutions and in and out of preservation repositories. The following example will hopefully demonstrate some of the concepts and challenges.

My institution is a partner in the NEH funded National Digital Newspaper Program. We are responsible for digitizing 100,000 pages of early Texas newspapers and delivering that content to the Library of Congress (LOC) during the two year funding cycle. The 100,000 newspaper pages are divided up into aggregations called batches. Each batch typically contains around 10,000 pages spread across anywhere from 1,500 – 2,000 issues of newspapers. Each batch is sent to LOC for validation according to their specification for newspaper digitization and if accepted is ingested into the Chronicling America (http://chroniclingamerica.loc.gov/) delivery application. When a batch reaches LOC, it is bagged and for the rest of its existence there, moves around as a bag, which conforms, to the BagIt specification.

My institution thinks about the digital content created for this project a little differently when it comes to the level of aggregation it uses for organizing, describing and moving the newspaper pages. We’ve decided that the level of aggregation we are interested in is the issue level. We take the same batches we deliver to LOC, convert them into our local model for digital content, add some metadata and package them up as bags for ingest into our local repository (http://texashistory.unt.edu/).

Both institutions are working with the exact same content yet package and manage it in fundamentally different ways. We do have one unifying feature between the two, which is we both use the BagIt specification and in theory would be able to manage each other’s content in our systems (though possibly with a limited level of service) if it was so desired.

Which is right? It doesn’t matter.

So why am I writing about this? Right now there are several independent conversations going on in the digital preservation community on the possibility of building services, which would use the BagIt specification as submission and dissemination format for a digital preservation service. I’ve thought of this in the idea of a BagTorrent sort of system, others (Michael Giarlo) have proposed a Dropbox type system for Bags. I think this discussion is quite interesting and would help institutions like my own work with some of the preservation systems out there to provide a geographically separate copy of the content we are responsible for here at UNT.

mark e. phillips journal

Monthly Archives: November 2010

Level of aggregation / aggravation