DPLA Metadata Analysis: Part 2 – Beyond basic stats

More stats for subjects

In my previous post I displayed some of the statistics that are readily available from Solr as part of its StatsComponent functionality (if you haven’t used this part of Solr yet you really should). There are a few other things that we could collect to get a more complete picture of a metadata field.

So far we have min, max, number of records, total number of subjects, sumofsquares, mean, and standard deviation. The other values I think we should take a look at are the following.

Records without Subjects – Number of records without subjects.

Percent of records without subjects – Percentage of the Hubs records that don’t have subjects

Mode – Number of subjects-per-record that is the most common for a specific Hub.

Unique Subjects – Unique subject strings are present for a specific Hub

Hub Unique Subjects – Number of subjects that are unique to that Hub.

Entropy of the field. – This calculation is a measure of the uncertainty in the metadata field, but for our purposes it is a good measure to understand how the distribution of subjects happens in the records.

Below is a table that contains the fields listed above, plus some relevant fields from the previous post. Each Hub has a row in this table.

Hub Name	Records	Records Without Subjects	% without Subjects	Avg. Subjects per records	Subject Count Mode	Unique Subjects	# of subjects unique to hub	Entropy
ARTstor	56,342	6,586	11.7	3.5	3	9,560	4,941	0.73
Biodiversity Heritage Library	138,288	10,326	7.5	3.3	2	22,004	9,136	0.65
David Rumsey	48,132	30,167	62.7	0.5	0	123	30	0.76
Digital Commonwealth	124,804	6,040	4.8	2.4	1	41,704	31,094	0.77
Digital Library of Georgia	259,640	3,216	1.2	4.4	2	132,160	114,689	0.67
Harvard Library	10,568	167	1.6	2.5	2	9,257	7,204	0.76
HathiTrust	1,915,159	525,874	27.5	1.4	1	685,733	570,292	0.88
Internet Archive	208,953	44,872	21.5	1.8	1	56,911	28,978	0.80
J. Paul Getty Trust	92,681	73,978	79.8	0.4	0	2,777	1,852	0.60
Kentucky Digital Library	127,755	117,790	92.2	0.2	0	1,972	1,337	0.62
Minnesota Digital Library	40,533	0	0	5	4	24,472	17,545	0.74
Missouri Hub	41,557	11,451	27.6	2.3	0	6,893	4,338	0.69
Mountain West Digital Library	867,538	49,473	5.7	3	1	227,755	192,501	0.68
National Archives and Records Administration	700,952	619,212	88.3	0.3	0	7,086	3,589	0.63
North Carolina Digital Heritage Center	260,709	41,323	15.9	3.3	2	99,258	84,203	0.66
Smithsonian Institution	897,196	29,452	3.3	6.4	7	348,302	325,878	0.62
South Carolina Digital Library	76,001	7,460	9.8	3	2	23,842	18,110	0.72
The New York Public Library	1,169,576	208,472	17.8	1.7	1	69,210	52,002	0.62
The Portal to Texas History	477,639	58	0	11	10	104,566	87,076	0.49
United States Government Printing Office (GPO)	148,715	1,794	1.2	3.1	2	174,067	105,389	0.92
University of Illinois at Urbana-Champaign	18,103	4,221	23.3	3.8	0	6,183	3,076	0.63
University of Southern California. Libraries	301,325	35,106	11.7	2.9	2	65,958	51,822	0.59
University of Virginia Library	30,188	229	0.8	3.2	1	3,736	2,425	0.60

In looking at the row for The Portal to Texas History we can see that of the 477,639 records in the dataset, 58 of them do not have any subjects, which is a very small percentage (0.01214306202 to be exact). From there we can go to the average of 11 subjects per record with a mode of 10, nothing earth shaking here, just more info. There are 104,566 unique subjects in the Portal’s dataset with 87,076 of those being unique to only the Portal. Finally the entropy for the Portal’s subject field is 0.49, if compared to GPO’s which is 0.92 you can interpret this to mean that the subject values are more “clumpy” for the Portal, (a smaller number of subjects are used across a larger number of records) than for GPO (a larger number of subjects are used across records).

The following two tables further illustrate the entropy values on the Portal’s and GPO’s subjects. The first table is the top ten subjects and the number of records with those subjects from the GPO’s dataset

National security–United States	1,138
United States. Congress. House–Rules and practice	748
Terrorism–United States–Prevention	718
United States. Department of Defense–Appropriations and expenditures	631
United States	536
Social security–United States–Periodicals	487
Emergency management–United States	485
Medicare	441
Consumer protection–United States	417
Wisconsin–Maps	406

Now take a look at the top ten subjects and their counts for the Portal.

Places	310,404
United States	306,597
Texas	305,551
Business, Economics and Finance	248,455
Communications	223,783
Newspapers	221,422
Advertising	218,527
Journalism	217,737
Landscape and Nature	76,308
Geography and Maps	70,742

So with the entropy value, you can read a lower number to be more like the Portal’s subjects and the higher number to be more like GPO’s. At the extreme, a value of 1.0 would mean that every subject is used by one record, and value of 0 would mean that there is only one subject with all of the records using said subject.

Shared Subjects

In creating the table above I had to work out the number subjects that a hub has uniquely. In doing so I went ahead and calculated this number for the whole dataset to find out how much subject overlap occurs.

The table below displays the breakdown of how subjects are distributed across Hub collection. For example if two Hubs have the subject “Laws of Texas” then it is said to be shared by two Hubs. The breakdown for the metadata in the DPLA is as follows.

# of Hubs with subject	Count
1	1,717,512
2	114,047
3	21,126
4	8,013
5	3,905
6	2,187
7	1,330
8	970
9	689
10	494
11	405
12	302
13	245
14	199
15	152
16	117
17	63
18	62
19	32
20	20
21	7
22	7

Most of the subjects 1,717,512 to be exactly occur in only one Hub’s collection.

There are seven different subjects that are common across 22 of the 23 Hubs in the DPLA metadata dataset, if you are curious, theses subjects are the following:

There should be one final post in this series where I can hopefully suggest what we should do with this data.

Again, if you want to chat about this post, hit me up on Twitter.