DPLA Metadata Analysis: Part 2 – Beyond basic stats

More stats for subjects

In my previous post I displayed some of the statistics that are readily available from Solr as part of its StatsComponent functionality (if you haven’t used this part of Solr yet you really should). There are a few other things that we could collect to get a more complete picture of a metadata field.

So far we have min, max, number of records, total number of subjects, sumofsquares, mean, and standard deviation. The other values I think we should take a look at are the following.

Records without Subjects – Number of records without subjects.

Percent of records without subjects – Percentage of the Hubs records that don’t have subjects

Mode – Number of subjects-per-record that is the most common for a specific Hub.

Unique Subjects – Unique subject strings are present for a specific Hub

Hub Unique Subjects – Number of subjects that are unique to that Hub.

Entropy of the field. – This calculation is a measure of the uncertainty in the metadata field, but for our purposes it is a good measure to understand how the distribution of subjects happens in the records.

Below is a table that contains the fields listed above,  plus some relevant fields from the previous post. Each Hub has a row in this table.

Hub Name Records Records Without Subjects % without Subjects Avg. Subjects per records Subject Count Mode Unique Subjects # of subjects unique to hub Entropy
ARTstor 56,342 6,586 11.7 3.5 3 9,560 4,941 0.73
Biodiversity Heritage Library 138,288 10,326 7.5 3.3 2 22,004 9,136 0.65
David Rumsey 48,132 30,167 62.7 0.5 0 123 30 0.76
Digital Commonwealth 124,804 6,040 4.8 2.4 1 41,704 31,094 0.77
Digital Library of Georgia 259,640 3,216 1.2 4.4 2 132,160 114,689 0.67
Harvard Library 10,568 167 1.6 2.5 2 9,257 7,204 0.76
HathiTrust 1,915,159 525,874 27.5 1.4 1 685,733 570,292 0.88
Internet Archive 208,953 44,872 21.5 1.8 1 56,911 28,978 0.80
J. Paul Getty Trust 92,681 73,978 79.8 0.4 0 2,777 1,852 0.60
Kentucky Digital Library 127,755 117,790 92.2 0.2 0 1,972 1,337 0.62
Minnesota Digital Library 40,533 0 0 5 4 24,472 17,545 0.74
Missouri Hub 41,557 11,451 27.6 2.3 0 6,893 4,338 0.69
Mountain West Digital Library 867,538 49,473 5.7 3 1 227,755 192,501 0.68
National Archives and Records Administration 700,952 619,212 88.3 0.3 0 7,086 3,589 0.63
North Carolina Digital Heritage Center 260,709 41,323 15.9 3.3 2 99,258 84,203 0.66
Smithsonian Institution 897,196 29,452 3.3 6.4 7 348,302 325,878 0.62
South Carolina Digital Library 76,001 7,460 9.8 3 2 23,842 18,110 0.72
The New York Public Library 1,169,576 208,472 17.8 1.7 1 69,210 52,002 0.62
The Portal to Texas History 477,639 58 0 11 10 104,566 87,076 0.49
United States Government Printing Office (GPO) 148,715 1,794 1.2 3.1 2 174,067 105,389 0.92
University of Illinois at Urbana-Champaign 18,103 4,221 23.3 3.8 0 6,183 3,076 0.63
University of Southern California. Libraries 301,325 35,106 11.7 2.9 2 65,958 51,822 0.59
University of Virginia Library 30,188 229 0.8 3.2 1 3,736 2,425 0.60

In looking at the row for The Portal to Texas History we can see that of the 477,639 records in the dataset, 58 of them do not have any subjects,  which is a very small percentage (0.01214306202 to be exact).  From there we can go to the average of 11 subjects per record with a mode of 10,  nothing earth shaking here,  just more info.  There are 104,566 unique subjects in the Portal’s dataset with 87,076 of those being unique to only the Portal.  Finally the entropy for the Portal’s subject field is 0.49,  if compared to GPO’s which is 0.92 you can interpret this to mean that the subject values are more “clumpy” for the Portal,  (a smaller number of subjects are used across a larger number of records) than for GPO (a larger number of subjects are used across records).

The following two tables further illustrate the entropy values on the Portal’s and GPO’s subjects. The first table is the top ten subjects and the number of records with those subjects from the GPO’s dataset

National security–United States 1,138
United States. Congress. House–Rules and practice 748
Terrorism–United States–Prevention 718
United States. Department of Defense–Appropriations and expenditures 631
United States 536
Social security–United States–Periodicals 487
Emergency management–United States 485
Medicare 441
Consumer protection–United States 417
Wisconsin–Maps 406

Now take a look at the top ten subjects and their counts for the Portal.

Places 310,404
United States 306,597
Texas 305,551
Business, Economics and Finance 248,455
Communications 223,783
Newspapers 221,422
Advertising 218,527
Journalism 217,737
Landscape and Nature 76,308
Geography and Maps 70,742

So with the entropy value,  you can read a lower number to be more like the Portal’s subjects and the higher number to be more like GPO’s.  At the extreme,  a value of 1.0 would mean that every subject is used by one record, and value of 0 would mean that there is only one subject with all of the records using said subject.

Shared Subjects

In creating the table above I had to work out the number subjects that a hub has uniquely.  In doing so I went ahead and calculated this number for the whole dataset to find out how much subject overlap occurs.

The table below displays the breakdown of how subjects are distributed across Hub collection.  For example if two Hubs have the subject “Laws of Texas” then it is said to be shared by two Hubs.  The breakdown for the metadata in the DPLA is as follows.

# of Hubs with subject Count
1 1,717,512
2 114,047
3 21,126
4 8,013
5 3,905
6 2,187
7 1,330
8 970
9 689
10 494
11 405
12 302
13 245
14 199
15 152
16 117
17 63
18 62
19 32
20 20
21 7
22 7

Most of the subjects 1,717,512 to be exactly occur in only one Hub’s collection.

There are seven different subjects that are common across 22 of the 23 Hubs in the DPLA metadata dataset,  if you are curious,  theses subjects are the following:

There should be one final post in this series where I can hopefully suggest what we should do with this data.

Again, if you want to chat about this post,  hit me up on Twitter.