Monthly Archives: September 2015

File Duplication in the UNT Libraries Digital Collections

Introduction

A few months ago I was following a conversation on Twitter that for got me thinking about how much bit-for-bit duplication there was in our preservation repository and how much space that duplication amounted to.

I let this curiosity sit for a few months and finally pulled the data from the repository in order to get some answers.

Getting the data

Each of the digital objects in our repository have a METS record that conforms to the UNTL-AIP-METS Profile registered with the Library of Congress. One of the features of this METS profile (like many others) is that these files make use of is the fileStruct section and for each file in a digital object, there exist the following pieces of information

Field Example Value
FileName  ark:/67531/metadc419149
CHECKSUM  bc95eea528fa4f87b77e04271ba5e2d8
CHECKSUMTYPE  MD5
USE  0
MIMETYPE  image/tiff
CREATED  2014-11-17T22:58:37Z
SIZE 60096742
FILENAME file://data/01_tif/2012.201.B0389.0516.TIF
OWNERID urn:uuid:295e97ff-0679-4561-a60d-62def4e2e88a
ADMID amd_00013 amd_00015 amd_00014
ID file_00005

By extracting this information for each file in each of the digital objects I would be able to get at the initial question I had about duplication at the file level and how much space it accounted for in the repository.

Extracted Data

At the time of writing of this post the Coda Repository that acts as the preservation repository for the UNT Libraries Digital Collections contains 1.3 million digital objects that occupy 285TB of primary data. These 1.3 million digital objects consist of 151 million files that have fixity values in the repository.

The dataset that I extracted has 1,123,228 digital objects because it was extracted a few months ago. Another piece of information that is helpful to know is that the numbers that we report for “file managed by Coda (151 million mentioned above) include both the primary files ingested into the repository as well as metadata files added to the Archival Information Packages as they are ingested into the repository. The analysis in this post deals only with the primary data files deposited with the initial SIP and do not include the extra metadata files. This dataset contains information about 60,164,181 files in the repository.

Analyzing the Data

Once I acquired the METS records from the Coda repository I wrote a very simple script to extract information from the File section of the METS records and format that data into a Tab separated dataset that I could use for subsequent analysis work. Because of the duplication of some of the data to each row to make processing easier, this resulted in a Tab separated file that is just over 9 GB in size (1.9 GB compressed) that contains the 60,164,181 rows, one for each file.

Here is a representation as a table for a few rows of data.

METS File CHECKSUM CHECKSUMTYPE USE MIMETYPE CREATION SIZE FILENAME
metadc419149.aip.mets.xml bc95eea528fa4f87b77e04271ba5e2d8 md5 0 image/tiff 2014-11-17T22:58:37Z 60096742 file://data/01_tif/2012.201.B0389.0516.TIF
metadc419149.aip.mets.xml 980a81b95ed4f2cda97a82b1e4228b92 md5 0 text/plain 2014-11-17T22:58:37Z 557 file://data/02_json/2012.201.B0389.0516.json
metadc419544.aip.mets.xml 0fba542ac5c02e1dc2cba9c7cc436221 md5 0 image/tiff 2014-11-17T23:20:57Z 51603206 file://data/01_tif/2012.201.B0391.0539.TIF
metadc419544.aip.mets.xml 0420bff971b151442fa61b4eea9135dd md5 0 text/plain 2014-11-17T23:20:57Z 372 file://data/02_json/2012.201.B0391.0539.json
metadc419034.aip.mets.xml df33c7e9d78177340e0661fb05848cc4 md5 0 image/tiff 2014-11-17T23:42:16Z 57983974 file://data/01_tif/2012.201.B0394.0493.TIF
metadc419034.aip.mets.xml 334827a9c32ea591f8633406188c9283 md5 0 text/plain 2014-11-17T23:42:16Z 579 file://data/02_json/2012.201.B0394.0493.json
metadc419479.aip.mets.xml 4c93737d6d8a44188b5cd656d36f1e3d md5 0 image/tiff 2014-11-17T23:01:15Z 51695974 file://data/01_tif/2012.201.B0389.0678.TIF
metadc419479.aip.mets.xml bcba5d94f98bf48181e2159b30a0df4f md5 0 text/plain 2014-11-17T23:01:15Z 486 file://data/02_json/2012.201.B0389.0678.json
metadc419495.aip.mets.xml e2f4d1d7d4cd851fea817879515b7437 md5 0 image/tiff 2014-11-17T22:30:10Z 55780430 file://data/01_tif/2012.201.B0387.0179.TIF
metadc419495.aip.mets.xml 73f72045269c30ce3f5f73f2b60bf6d5 md5 0 text/plain 2014-11-17T22:30:10Z 499 file://data/02_json/2012.201.B0387.0179.json

My first step at this was to extract the column that stored the MD5 fixity value, sort that column and then find the number of the instances of each fixity value in the dataset. The command ends up looking like this:

cut –f 2 mets_dataset.tsv | sort | uniq –c | sort –nr | head

This worked pretty will and resulted with the MD5 values that occurred the most. This represents the duplication at the file level in the repository.

Count Fixity Value
72,906 68b329da9893e34099c7d8ad5cb9c940
29,602 d41d8cd98f00b204e9800998ecf8427e
3,363 3c80c3bf89652f466c5339b98856fa9f
2,447 45d36f6fae3461167ddef76ecf304035
2,441 388e2017ac36ad7fd20bc23249de5560
2,237 e1c06d85ae7b8b032bef47e42e4c08f9
2,183 6d5f66a48b5ccac59f35ab3939d539a3
1,905 bb7559712e45fa9872695168ee010043
1,859 81051bcc2cf1bedf378224b0a93e2877
1,706 eeb3211246927547a4f8b50a76b31864

There are a few things to note here, first because of the way that we version items in the repository, there is going to be some duplication because of our versioning strategy. If you are interested in understanding the versioning process we use for our system and the overhead that occurs because of this strategy you can take a look at the whitepaper we wrote a in 2014 about the subject.

Phillips, Mark Edward & Ko, Lauren. Understanding Repository Growth at the University of North Texas: A Case Study. UNT Digital Library. http://digital.library.unt.edu/ark:/67531/metadc306052/. Accessed September 26, 2015.

To get a better idea of the kinds of files that are duplicated in the repository, the following table shows fields for the top five more repeated files.

Count MD5 Bytes Mimetype Common File Extension
72,906 68b329da9893e34099c7d8ad5cb9c940 1 text/plain txt
29,602 d41d8cd98f00b204e9800998ecf8427e 0 application/x-empty txt
3,363 3c80c3bf89652f466c5339b98856fa9f 20 text/plain txt
2,447 45d36f6fae3461167ddef76ecf304035 195 application/xml xml
2,441 388e2017ac36ad7fd20bc23249de5560 21 text/plain txt
2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 text/plain txt
2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 text/plain txt
1,905 bb7559712e45fa9872695168ee010043 61,192 image/jpeg jpg
1,859 81051bcc2cf1bedf378224b0a93e2877 2 text/plain txt
1,706 eeb3211246927547a4f8b50a76b31864 200 application/xml xml

You can see that most of the files that are duplicated are very small in size,  0, 1, 2, and three bytes.  The largest  were jpegs that were represented 1,905 times in the dataset and each were 61,192 byes.  The makeup of files for these top examples are txt, xml and jpg.

Overall we see that for the 60,164,181 rows in the dataset, there are 59,177,155 unique md5 hashes.  This means that 98% of the files in the repository are in fact unique.  Of the 987,026 rows in the dataset that are duplicates of other fixity values,  there are 666,259 unique md5 hashes.

So now we know that there is some duplication in the repository at the file level. Next I wanted to know what kind of effect does this have on the storage allocated. I took care of this by taking the 666,259 values that contained duplicates and went back to pull the number of bytes for those files. I calculated the storage overhead for each of these fixity values as bytes x instances – 1 to remove the size of the initial storage, thus showing only the duplication overhead.

Here is the table for the ten most duplicated files to show that calculation.

Count MD5 Bytes per File Duplicate File Overhead (Bytes)
72,906 68b329da9893e34099c7d8ad5cb9c940 1 72,905
29,602 d41d8cd98f00b204e9800998ecf8427e 0 0
3,363 3c80c3bf89652f466c5339b98856fa9f 20 67,240
2,447 45d36f6fae3461167ddef76ecf304035 195 476,970
2,441 388e2017ac36ad7fd20bc23249de5560 21 51,240
2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 4,472
2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 6,546
1,905 bb7559712e45fa9872695168ee010043 61,192 116,509,568
1,859 81051bcc2cf1bedf378224b0a93e2877 2 3,716
1,706 eeb3211246927547a4f8b50a76b31864 200 341,000

After taking the overhead for each row of duplicates,  I ended up with 2,746,536,537,700 bytes or 2.75 TB of overhead because of file duplication in the Coda repository.

Conclusion

I don’t think there is much surprise that there is going to be duplication of files in a repository. The most common file we have that is duplicated is a txt file with just one byte.

What I will do with this information I don’t really know. I think that the overall duplication across digital objects is a feature and not a bug. I like the idea of more redundancy when reasonable. It should be noted that this redundancy is often over files that from what I can tell carry very little information (i.e. tiff images of blank pages, or txt files with 0, 1, or 2 bytes of data)

I do know that this kind of data can be helpful when talking with vendors that provide integrated “de-duplication services” into their storage arrays, though that de-duplication is often at a smaller unit that the entire file. It might be interesting to take a stab at seeing what the effect of different de-duplication methodologies and algorithms on a large collection of digital content might be, so if anyone has some interest and algorithms I’d be game on giving it a try.

That’s all for this post, but I have a feeling I might be dusting off this dataset in the future to take a look at some other information such as filesizes and mimetype information that we have in our repository.