I had a conversation with someone last week about how libraries “integrate” born-digital content into their collections and wanted to do a little experiment to see exactly what the steps are in the system we’ve created here at UNT.
I decided to use the electronic book “A Guide to Distributed Digital Preservation” from the MetaArchive Cooporative. I think a text like this should be a part of the collection at any library who is talking about “digital preservation”
For the record here is the link to the text. http://www.metaarchive.org/GDDP
Below are the general steps that I had to go to in order to get this http://digital.library.unt.edu/ark:/67531/metadc12850/ which is our copy in the UNT Digital Library.
Step 1. Download file.
Pretty easy step. I just did the good old right click, save as and I had it on my computer.
Step 2. Convert PDF to a series of jpg images
We’ve made the decision to start converting pdf documents we add to our system into a series of jpg or tiff images in addition to saving the pdf file. This is done for three reasons, a. we want to use our page turning system with word highlighting and all of that stuff. b. we want to count the number of “pages” we have in our system instead of just one file for the whole pdf. c. It isn’t a horrible idea to have another format for a born digital file, conversion to an image format gives you most of the functionality of the pdf if for some reason the source becomes unusable over time.
Step 3. Rename jpg files.
When I convert the pdf to jpg I use a script which gives me filenames that look like this
00000.jpg 00001.jpg 00002.jpg 00003.jpg
I want to know the page numbers of the page as well as the sequence of the page. So I use something we call magicknumbers (I have no idea if there is actually a name for this way of naming files or not)
We use ACDSee to do the numbering in our lab so I used that. I ended up with filenames that look like this.
000100fc.jpg ... 01210106.jpg 01220107.jpg 01230108.jpg
basically you have to sequences of four digits, the first is the running sequence of files and the second matches up with the number notated on the page. There are clues to other things like front covers, back covers, title pages and two ways of handling roman numerals. Our ingest system has a flag that allows us to parse these filenames and insert the page number into the correct place in our METS records.
Step 4. OCR
We run Prime Recognitions PrimeOCR product for all content that isn’t newspapers (we use Abbyy for that stuff) so I just ran it through that system with our default configuration.
The output of this process is a set of .pro files which is the raw output from the OCR engine. We have a script that will then generate a txt file and a bounding box file from the pro file.
Step 5. Create metadata
We could have created metadata at any point so far or at any point later on in this process. It seems as good a time to do this as any so I sent it over to our metadata librarian who sent it back within the hour with the metadata completed.
Step 6. Create Submission Information Package (SIP)
I create a SIP from the object to ingest into our system. This digital object has a metadata record, two manifestations (jpg manifestation and pdf manifestation) and is ready to go.
Packaging takes about 20 seconds. When finished I have a BagIt Bag to send to our ingest system.
Step 7. Rsync to ingest system.
I push the Bag to our ingest system’s dropbox for the UNT Digital Library namespace
Step 8. Create Archival Information Package (AIP)
I run our makeAIP.py script and out pops an AIP that is ready to go to our digital archive.
Step 9. Create Access Content Package (ACP)
I run our makeACP.py script and all the thumbnails and various sized web scale images come out the other end which will go to our content delivery system.
Step 10. Move ACP and AIP
rsync the ACP to Aubrey and rsync the AIP to Coda
Step 11. Index item that was added to Aubrey.
We typically index all the content we add when it is added, if we don’t for some reason then it gets indexed at the end of the day with our normal â€œreindex everything that got added or changed todayâ€ cron job.
It took less than an hour for me to run this document http://www.metaarchive.org/GDDP through our workflow and end up with this http://digital.library.unt.edu/ark:/67531/metadc12850/. Most of the time was waiting on the metadata creation which required interfacing with someone else.
I have absolutely no idea how much work would be involved to get a physical version of the book on the shelf but that is an interesting thing to look into.