More stats for subjects
In my previous post I displayed some of the statistics that are readily available from Solr as part of its StatsComponent functionality (if you haven’t used this part of Solr yet you really should). There are a few other things that we could collect to get a more complete picture of a metadata field.
So far we have min, max, number of records, total number of subjects, sumofsquares, mean, and standard deviation. The other values I think we should take a look at are the following.
Records without Subjects – Number of records without subjects.
Percent of records without subjects – Percentage of the Hubs records that don’t have subjects
Mode – Number of subjects-per-record that is the most common for a specific Hub.
Unique Subjects – Unique subject strings are present for a specific Hub
Hub Unique Subjects – Number of subjects that are unique to that Hub.
Entropy of the field. – This calculation is a measure of the uncertainty in the metadata field, but for our purposes it is a good measure to understand how the distribution of subjects happens in the records.
Below is a table that contains the fields listed above, plus some relevant fields from the previous post. Each Hub has a row in this table.
Hub Name | Records | Records Without Subjects | % without Subjects | Avg. Subjects per records | Subject Count Mode | Unique Subjects | # of subjects unique to hub | Entropy |
ARTstor | 56,342 | 6,586 | 11.7 | 3.5 | 3 | 9,560 | 4,941 | 0.73 |
Biodiversity Heritage Library | 138,288 | 10,326 | 7.5 | 3.3 | 2 | 22,004 | 9,136 | 0.65 |
David Rumsey | 48,132 | 30,167 | 62.7 | 0.5 | 0 | 123 | 30 | 0.76 |
Digital Commonwealth | 124,804 | 6,040 | 4.8 | 2.4 | 1 | 41,704 | 31,094 | 0.77 |
Digital Library of Georgia | 259,640 | 3,216 | 1.2 | 4.4 | 2 | 132,160 | 114,689 | 0.67 |
Harvard Library | 10,568 | 167 | 1.6 | 2.5 | 2 | 9,257 | 7,204 | 0.76 |
HathiTrust | 1,915,159 | 525,874 | 27.5 | 1.4 | 1 | 685,733 | 570,292 | 0.88 |
Internet Archive | 208,953 | 44,872 | 21.5 | 1.8 | 1 | 56,911 | 28,978 | 0.80 |
J. Paul Getty Trust | 92,681 | 73,978 | 79.8 | 0.4 | 0 | 2,777 | 1,852 | 0.60 |
Kentucky Digital Library | 127,755 | 117,790 | 92.2 | 0.2 | 0 | 1,972 | 1,337 | 0.62 |
Minnesota Digital Library | 40,533 | 0 | 0 | 5 | 4 | 24,472 | 17,545 | 0.74 |
Missouri Hub | 41,557 | 11,451 | 27.6 | 2.3 | 0 | 6,893 | 4,338 | 0.69 |
Mountain West Digital Library | 867,538 | 49,473 | 5.7 | 3 | 1 | 227,755 | 192,501 | 0.68 |
National Archives and Records Administration | 700,952 | 619,212 | 88.3 | 0.3 | 0 | 7,086 | 3,589 | 0.63 |
North Carolina Digital Heritage Center | 260,709 | 41,323 | 15.9 | 3.3 | 2 | 99,258 | 84,203 | 0.66 |
Smithsonian Institution | 897,196 | 29,452 | 3.3 | 6.4 | 7 | 348,302 | 325,878 | 0.62 |
South Carolina Digital Library | 76,001 | 7,460 | 9.8 | 3 | 2 | 23,842 | 18,110 | 0.72 |
The New York Public Library | 1,169,576 | 208,472 | 17.8 | 1.7 | 1 | 69,210 | 52,002 | 0.62 |
The Portal to Texas History | 477,639 | 58 | 0 | 11 | 10 | 104,566 | 87,076 | 0.49 |
United States Government Printing Office (GPO) | 148,715 | 1,794 | 1.2 | 3.1 | 2 | 174,067 | 105,389 | 0.92 |
University of Illinois at Urbana-Champaign | 18,103 | 4,221 | 23.3 | 3.8 | 0 | 6,183 | 3,076 | 0.63 |
University of Southern California. Libraries | 301,325 | 35,106 | 11.7 | 2.9 | 2 | 65,958 | 51,822 | 0.59 |
University of Virginia Library | 30,188 | 229 | 0.8 | 3.2 | 1 | 3,736 | 2,425 | 0.60 |
In looking at the row for The Portal to Texas History we can see that of the 477,639 records in the dataset, 58 of them do not have any subjects, which is a very small percentage (0.01214306202 to be exact). From there we can go to the average of 11 subjects per record with a mode of 10, nothing earth shaking here, just more info. There are 104,566 unique subjects in the Portal’s dataset with 87,076 of those being unique to only the Portal. Finally the entropy for the Portal’s subject field is 0.49, if compared to GPO’s which is 0.92 you can interpret this to mean that the subject values are more “clumpy” for the Portal, (a smaller number of subjects are used across a larger number of records) than for GPO (a larger number of subjects are used across records).
The following two tables further illustrate the entropy values on the Portal’s and GPO’s subjects. The first table is the top ten subjects and the number of records with those subjects from the GPO’s dataset
National security–United States | 1,138 |
United States. Congress. House–Rules and practice | 748 |
Terrorism–United States–Prevention | 718 |
United States. Department of Defense–Appropriations and expenditures | 631 |
United States | 536 |
Social security–United States–Periodicals | 487 |
Emergency management–United States | 485 |
Medicare | 441 |
Consumer protection–United States | 417 |
Wisconsin–Maps | 406 |
Now take a look at the top ten subjects and their counts for the Portal.
Places | 310,404 |
United States | 306,597 |
Texas | 305,551 |
Business, Economics and Finance | 248,455 |
Communications | 223,783 |
Newspapers | 221,422 |
Advertising | 218,527 |
Journalism | 217,737 |
Landscape and Nature | 76,308 |
Geography and Maps | 70,742 |
So with the entropy value, you can read a lower number to be more like the Portal’s subjects and the higher number to be more like GPO’s. At the extreme, a value of 1.0 would mean that every subject is used by one record, and value of 0 would mean that there is only one subject with all of the records using said subject.
Shared Subjects
In creating the table above I had to work out the number subjects that a hub has uniquely. In doing so I went ahead and calculated this number for the whole dataset to find out how much subject overlap occurs.
The table below displays the breakdown of how subjects are distributed across Hub collection. For example if two Hubs have the subject “Laws of Texas” then it is said to be shared by two Hubs. The breakdown for the metadata in the DPLA is as follows.
# of Hubs with subject | Count |
1 | 1,717,512 |
2 | 114,047 |
3 | 21,126 |
4 | 8,013 |
5 | 3,905 |
6 | 2,187 |
7 | 1,330 |
8 | 970 |
9 | 689 |
10 | 494 |
11 | 405 |
12 | 302 |
13 | 245 |
14 | 199 |
15 | 152 |
16 | 117 |
17 | 63 |
18 | 62 |
19 | 32 |
20 | 20 |
21 | 7 |
22 | 7 |
Most of the subjects 1,717,512 to be exactly occur in only one Hub’s collection.
There are seven different subjects that are common across 22 of the 23 Hubs in the DPLA metadata dataset, if you are curious, theses subjects are the following:
There should be one final post in this series where I can hopefully suggest what we should do with this data.
Again, if you want to chat about this post, hit me up on Twitter.