A few months ago I was following a conversation on Twitter that for got me thinking about how much bit-for-bit duplication there was in our preservation repository and how much space that duplication amounted to.
I let this curiosity sit for a few months and finally pulled the data from the repository in order to get some answers.
Getting the data
Each of the digital objects in our repository have a METS record that conforms to the UNTL-AIP-METS Profile registered with the Library of Congress. One of the features of this METS profile (like many others) is that these files make use of is the fileStruct section and for each file in a digital object, there exist the following pieces of information
|ADMID||amd_00013 amd_00015 amd_00014|
By extracting this information for each file in each of the digital objects I would be able to get at the initial question I had about duplication at the file level and how much space it accounted for in the repository.
At the time of writing of this post the Coda Repository that acts as the preservation repository for the UNT Libraries Digital Collections contains 1.3 million digital objects that occupy 285TB of primary data. These 1.3 million digital objects consist of 151 million files that have fixity values in the repository.
The dataset that I extracted has 1,123,228 digital objects because it was extracted a few months ago. Another piece of information that is helpful to know is that the numbers that we report for “file managed by Coda (151 million mentioned above) include both the primary files ingested into the repository as well as metadata files added to the Archival Information Packages as they are ingested into the repository. The analysis in this post deals only with the primary data files deposited with the initial SIP and do not include the extra metadata files. This dataset contains information about 60,164,181 files in the repository.
Analyzing the Data
Once I acquired the METS records from the Coda repository I wrote a very simple script to extract information from the File section of the METS records and format that data into a Tab separated dataset that I could use for subsequent analysis work. Because of the duplication of some of the data to each row to make processing easier, this resulted in a Tab separated file that is just over 9 GB in size (1.9 GB compressed) that contains the 60,164,181 rows, one for each file.
Here is a representation as a table for a few rows of data.
My first step at this was to extract the column that stored the MD5 fixity value, sort that column and then find the number of the instances of each fixity value in the dataset. The command ends up looking like this:
cut –f 2 mets_dataset.tsv | sort | uniq –c | sort –nr | head
This worked pretty will and resulted with the MD5 values that occurred the most. This represents the duplication at the file level in the repository.
There are a few things to note here, first because of the way that we version items in the repository, there is going to be some duplication because of our versioning strategy. If you are interested in understanding the versioning process we use for our system and the overhead that occurs because of this strategy you can take a look at the whitepaper we wrote a in 2014 about the subject.
Phillips, Mark Edward & Ko, Lauren. Understanding Repository Growth at the University of North Texas: A Case Study. UNT Digital Library. http://digital.library.unt.edu/ark:/67531/metadc306052/. Accessed September 26, 2015.
To get a better idea of the kinds of files that are duplicated in the repository, the following table shows fields for the top five more repeated files.
|Count||MD5||Bytes||Mimetype||Common File Extension|
You can see that most of the files that are duplicated are very small in size, 0, 1, 2, and three bytes. The largest were jpegs that were represented 1,905 times in the dataset and each were 61,192 byes. The makeup of files for these top examples are txt, xml and jpg.
Overall we see that for the 60,164,181 rows in the dataset, there are 59,177,155 unique md5 hashes. This means that 98% of the files in the repository are in fact unique. Of the 987,026 rows in the dataset that are duplicates of other fixity values, there are 666,259 unique md5 hashes.
So now we know that there is some duplication in the repository at the file level. Next I wanted to know what kind of effect does this have on the storage allocated. I took care of this by taking the 666,259 values that contained duplicates and went back to pull the number of bytes for those files. I calculated the storage overhead for each of these fixity values as bytes x instances – 1 to remove the size of the initial storage, thus showing only the duplication overhead.
Here is the table for the ten most duplicated files to show that calculation.
|Count||MD5||Bytes per File||Duplicate File Overhead (Bytes)|
After taking the overhead for each row of duplicates, I ended up with 2,746,536,537,700 bytes or 2.75 TB of overhead because of file duplication in the Coda repository.
I don’t think there is much surprise that there is going to be duplication of files in a repository. The most common file we have that is duplicated is a txt file with just one byte.
What I will do with this information I don’t really know. I think that the overall duplication across digital objects is a feature and not a bug. I like the idea of more redundancy when reasonable. It should be noted that this redundancy is often over files that from what I can tell carry very little information (i.e. tiff images of blank pages, or txt files with 0, 1, or 2 bytes of data)
I do know that this kind of data can be helpful when talking with vendors that provide integrated “de-duplication services” into their storage arrays, though that de-duplication is often at a smaller unit that the entire file. It might be interesting to take a stab at seeing what the effect of different de-duplication methodologies and algorithms on a large collection of digital content might be, so if anyone has some interest and algorithms I’d be game on giving it a try.
That’s all for this post, but I have a feeling I might be dusting off this dataset in the future to take a look at some other information such as filesizes and mimetype information that we have in our repository.