DPLA Metadata Analysis: Part 1 – Basic stats on subjects

One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.

I was interested in finding out the following pieces of information.

  1. What is the average number and standard deviation of subjects-per-record in the DPLA
  2. How does this number compare across the partners?
  3. Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.

Building the Dataset

The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.

With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.

So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.

Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).

So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality.  For a description of this view the documentation on Solr’s documentation site.

Answering the questions

The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).

Answering question number two involved a small script to create a table for us, you will find that table below.

Hub Name min max count sum sumOfSquares mean stddev
ARTstor 0 71 56,342 194,948 1351826 3.460083064 3.467168662
Biodiversity Heritage Library 0 118 138,288 454,624 3100134 3.287515909 3.407385646
David Rumsey 0 4 48,132 22,976 33822 0.477353943 0.689083212
Digital Commonwealth 0 199 124,804 295,778 1767426 2.369940066 2.923194479
Digital Library of Georgia 0 161 259,640 1,151,369 8621935 4.43448236 3.680038874
Harvard Library 0 17 10,568 26,641 88155 2.520912188 1.409567895
HathiTrust 0 92 1,915,159 2,614,199 6951217 1.365003637 1.329038361
Internet Archive 0 68 208,953 385,732 1520200 1.84602279 1.966605872
J. Paul Getty Trust 0 36 92,681 32,999 146491 0.356049244 1.20575216
Kentucky Digital Library 0 13 127,755 26,009 82269 0.203584987 0.776219692
Minnesota Digital Library 1 78 40,533 202,484 1298712 4.995534503 2.661891328
Missouri Hub 0 139 41,557 97,115 606761 2.336910749 3.023203782
Mountain West Digital Library 0 129 867,538 2,641,065 17734515 3.044321978 3.34282307
National Archives and Records Administration 0 103 700,952 231,513 1143343 0.330283671 1.233711342
North Carolina Digital Heritage Center 0 1,476 260,709 869,203 8394791 3.333996908 4.591774892
Smithsonian Institution 0 548 897,196 5,763,459 56446687 6.423857217 4.652809633
South Carolina Digital Library 0 40 76,001 231,270 1125030 3.042986277 2.354387181
The New York Public Library 0 31 1,169,576 1,996,483 6585169 1.707014337 1.648179106
The Portal to Texas History 0 1,035 477,639 5,257,702 69662410 11.00768991 4.96771802
United States Government Printing Office (GPO) 0 30 148,715 457,097 1860297 3.073644219 1.749820977
University of Illinois at Urbana-Champaign 0 22 18,103 67,955 404383 3.753797713 2.871821391
University of Southern California. Libraries 0 119 301,325 863,535 4626989 2.865792749 2.672589058
University of Virginia Library 0 15 30,188 95,328 465286 3.157811051 2.332671249

The columns are min which is the minimum number of subjects per record for a given Hub,  Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items.  The column max shows the highest number of subjects per record.  Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub,  note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number.  The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.

The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.

Hub Type                                  min max count sum sumOfSquares mean stddev
Content-Hub 0 548 5,736,178 13,207,489 84723999 2.302489393 3.077118385
Service-Hub 0 1,476 2,276,176 10,771,995 109293849 4.73249652 5.061612337

It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs,  over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.

Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.

I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.

Hit me up on Twitter if you want to chat about this.