One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.
I was interested in finding out the following pieces of information.
- What is the average number and standard deviation of subjects-per-record in the DPLA
- How does this number compare across the partners?
- Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.
Building the Dataset
The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.
With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.
So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.
Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).
So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality. For a description of this view the documentation on Solr’s documentation site.
Answering the questions
The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).
Answering question number two involved a small script to create a table for us, you will find that table below.
Hub Name | min | max | count | sum | sumOfSquares | mean | stddev |
ARTstor | 0 | 71 | 56,342 | 194,948 | 1351826 | 3.460083064 | 3.467168662 |
Biodiversity Heritage Library | 0 | 118 | 138,288 | 454,624 | 3100134 | 3.287515909 | 3.407385646 |
David Rumsey | 0 | 4 | 48,132 | 22,976 | 33822 | 0.477353943 | 0.689083212 |
Digital Commonwealth | 0 | 199 | 124,804 | 295,778 | 1767426 | 2.369940066 | 2.923194479 |
Digital Library of Georgia | 0 | 161 | 259,640 | 1,151,369 | 8621935 | 4.43448236 | 3.680038874 |
Harvard Library | 0 | 17 | 10,568 | 26,641 | 88155 | 2.520912188 | 1.409567895 |
HathiTrust | 0 | 92 | 1,915,159 | 2,614,199 | 6951217 | 1.365003637 | 1.329038361 |
Internet Archive | 0 | 68 | 208,953 | 385,732 | 1520200 | 1.84602279 | 1.966605872 |
J. Paul Getty Trust | 0 | 36 | 92,681 | 32,999 | 146491 | 0.356049244 | 1.20575216 |
Kentucky Digital Library | 0 | 13 | 127,755 | 26,009 | 82269 | 0.203584987 | 0.776219692 |
Minnesota Digital Library | 1 | 78 | 40,533 | 202,484 | 1298712 | 4.995534503 | 2.661891328 |
Missouri Hub | 0 | 139 | 41,557 | 97,115 | 606761 | 2.336910749 | 3.023203782 |
Mountain West Digital Library | 0 | 129 | 867,538 | 2,641,065 | 17734515 | 3.044321978 | 3.34282307 |
National Archives and Records Administration | 0 | 103 | 700,952 | 231,513 | 1143343 | 0.330283671 | 1.233711342 |
North Carolina Digital Heritage Center | 0 | 1,476 | 260,709 | 869,203 | 8394791 | 3.333996908 | 4.591774892 |
Smithsonian Institution | 0 | 548 | 897,196 | 5,763,459 | 56446687 | 6.423857217 | 4.652809633 |
South Carolina Digital Library | 0 | 40 | 76,001 | 231,270 | 1125030 | 3.042986277 | 2.354387181 |
The New York Public Library | 0 | 31 | 1,169,576 | 1,996,483 | 6585169 | 1.707014337 | 1.648179106 |
The Portal to Texas History | 0 | 1,035 | 477,639 | 5,257,702 | 69662410 | 11.00768991 | 4.96771802 |
United States Government Printing Office (GPO) | 0 | 30 | 148,715 | 457,097 | 1860297 | 3.073644219 | 1.749820977 |
University of Illinois at Urbana-Champaign | 0 | 22 | 18,103 | 67,955 | 404383 | 3.753797713 | 2.871821391 |
University of Southern California. Libraries | 0 | 119 | 301,325 | 863,535 | 4626989 | 2.865792749 | 2.672589058 |
University of Virginia Library | 0 | 15 | 30,188 | 95,328 | 465286 | 3.157811051 | 2.332671249 |
The columns are min which is the minimum number of subjects per record for a given Hub, Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items. The column max shows the highest number of subjects per record. Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub, note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number. The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.
The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.
Hub Type | min | max | count | sum | sumOfSquares | mean | stddev |
Content-Hub | 0 | 548 | 5,736,178 | 13,207,489 | 84723999 | 2.302489393 | 3.077118385 |
Service-Hub | 0 | 1,476 | 2,276,176 | 10,771,995 | 109293849 | 4.73249652 | 5.061612337 |
It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs, over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.
Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.
I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.
Hit me up on Twitter if you want to chat about this.