One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.

I was interested in finding out the following pieces of information.

What is the average number and standard deviation of subjects-per-record in the DPLA
How does this number compare across the partners?
Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.

Building the Dataset

The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.

With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.

So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.

Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).

So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality. For a description of this view the documentation on Solr’s documentation site.

Answering the questions

The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).

Answering question number two involved a small script to create a table for us, you will find that table below.

Hub Name	min	max	count	sum	sumOfSquares	mean	stddev
ARTstor	0	71	56,342	194,948	1351826	3.460083064	3.467168662
Biodiversity Heritage Library	0	118	138,288	454,624	3100134	3.287515909	3.407385646
David Rumsey	0	4	48,132	22,976	33822	0.477353943	0.689083212
Digital Commonwealth	0	199	124,804	295,778	1767426	2.369940066	2.923194479
Digital Library of Georgia	0	161	259,640	1,151,369	8621935	4.43448236	3.680038874
Harvard Library	0	17	10,568	26,641	88155	2.520912188	1.409567895
HathiTrust	0	92	1,915,159	2,614,199	6951217	1.365003637	1.329038361
Internet Archive	0	68	208,953	385,732	1520200	1.84602279	1.966605872
J. Paul Getty Trust	0	36	92,681	32,999	146491	0.356049244	1.20575216
Kentucky Digital Library	0	13	127,755	26,009	82269	0.203584987	0.776219692
Minnesota Digital Library	1	78	40,533	202,484	1298712	4.995534503	2.661891328
Missouri Hub	0	139	41,557	97,115	606761	2.336910749	3.023203782
Mountain West Digital Library	0	129	867,538	2,641,065	17734515	3.044321978	3.34282307
National Archives and Records Administration	0	103	700,952	231,513	1143343	0.330283671	1.233711342
North Carolina Digital Heritage Center	0	1,476	260,709	869,203	8394791	3.333996908	4.591774892
Smithsonian Institution	0	548	897,196	5,763,459	56446687	6.423857217	4.652809633
South Carolina Digital Library	0	40	76,001	231,270	1125030	3.042986277	2.354387181
The New York Public Library	0	31	1,169,576	1,996,483	6585169	1.707014337	1.648179106
The Portal to Texas History	0	1,035	477,639	5,257,702	69662410	11.00768991	4.96771802
United States Government Printing Office (GPO)	0	30	148,715	457,097	1860297	3.073644219	1.749820977
University of Illinois at Urbana-Champaign	0	22	18,103	67,955	404383	3.753797713	2.871821391
University of Southern California. Libraries	0	119	301,325	863,535	4626989	2.865792749	2.672589058
University of Virginia Library	0	15	30,188	95,328	465286	3.157811051	2.332671249

The columns are min which is the minimum number of subjects per record for a given Hub, Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items. The column max shows the highest number of subjects per record. Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub, note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number. The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.

The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.

Hub Type	min	max	count	sum	sumOfSquares	mean	stddev
Content-Hub	0	548	5,736,178	13,207,489	84723999	2.302489393	3.077118385
Service-Hub	0	1,476	2,276,176	10,771,995	109293849	4.73249652	5.061612337

It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs, over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.

Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.

I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.

Hit me up on Twitter if you want to chat about this.

mark e. phillips journal

DPLA Metadata Analysis: Part 1 – Basic stats on subjects

Building the Dataset

Answering the questions