DPLA Metadata Analysis: Part 4 – Normalized Subjects

This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.

This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.

Background

One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific.  For example the heading “Children–Texas.” is constructed from two different pieces,  “Children”, and “Texas”.  If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.

The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis.  I had a question asked about what would happen if we normalized the subjects before we did the analysis on them,  effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.

Note:  Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index.  I also apply this same method to the un-normalized subjects.  In doing so I noticed that the item that had the  most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once.  Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects,  each of these values should go down.

Predictions

My predictions before the analysis are that we will see an increase in the number of unique subjects,  a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.

Results

With the normalization of subjects,  there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization,  a reduction in the number of unique subject headings by 38%.

In addition to the reduction of the total number of unique subject headings by 38% as stated above,  the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%.  The table below displays these numbers before and after normalization as well as the percentage change.

# of Hubs with Subject # of Subjects # of Normalized Subjects % Change
1 1,717,512 1,055,561 -39%
2 114,047 60,981 -47%
3 21,126 20,172 -5%
4 8,013 9,483 18%
5 3,905 5,130 31%
6 2,187 3,094 41%
7 1,330 2,024 52%
8 970 1,481 53%
9 689 1,080 57%
10 494 765 55%
11 405 571 41%
12 302 453 50%
13 245 413 69%
14 199 340 71%
15 152 261 72%
16 117 205 75%
17 63 152 141%
18 62 130 110%
19 32 77 141%
20 20 55 175%
21 7 38 443%
22 7 23 229%
23 0 2 N/A

The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”

The high level stats for all 8,012,390 records are available in the following table.

 Records Total Subject Strings Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
8,012,390 23,860,080 28,644,188 2.98 3.57 20.05%

You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.

Results by Hub

The table below presents data for each hub in the DPLA.  The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.

Hub Records Total Subject String Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
ARTstor 56,342 194,883 202,220 3.46 3.59 3.76
Biodiversity Heritage Library 138,288 453,843 452,007 3.28 3.27 -0.40
David Rumsey 48,132 22,976 22,976 0.48 0.48 0
Digital Commonwealth 124,804 295,778 336,935 2.37 2.7 13.91
Digital Library of Georgia 259,640 1,151,351 1,783,884 4.43 6.87 54.94
Harvard Library 10,568 26,641 36,511 2.52 3.45 37.05
HathiTrust 1,915,159 2,608,567 4,154,244 1.36 2.17 59.25
Internet Archive 208,953 363,634 412,640 1.74 1.97 13.48
J. Paul Getty Trust 92,681 32,949 43,590 0.36 0.47 32.30
Kentucky Digital Library 127,755 26,008 27,561 0.2 0.22 5.97
Minnesota Digital Library 40,533 202,456 211,539 4.99 5.22 4.49
Missouri Hub 41,557 97,111 117,933 2.34 2.84 21.44
Mountain West Digital Library 867,538 2,636,219 3,552,268 3.04 4.09 34.75
National Archives and Records Administration 700,952 231,513 231,513 0.33 0.33 0
North Carolina Digital Heritage Center 260,709 866,697 1,207,488 3.32 4.63 39.32
Smithsonian Institution 897,196 5,689,135 5,686,107 6.34 6.34 -0.05
South Carolina Digital Library 76,001 231,267 355,504 3.04 4.68 53.72
The New York Public Library 1,169,576 1,995,817 2,515,252 1.71 2.15 26.03
The Portal to Texas History 477,639 5,255,588 5,410,963 11 11.33 2.96
United States Government Printing Office (GPO) 148,715 456,363 768,830 3.07 5.17 68.47
University of Illinois at Urbana-Champaign 18,103 67,954 85,263 3.75 4.71 25.47
University of Southern California. Libraries 301,325 859,868 905,465 2.85 3 5.30
University of Virginia Library 30,188 93,378 123,405 3.09 4.09 32.16

The number of unique subjects before and after subject normalization is presented in the table below.  The percent of change is also included in the final column.

Hub Unique Subjects Unique Normalized Subjects % Change Unique
ARTstor 9,560 9,546 -0.15
Biodiversity Heritage Library 22,004 22,005 0
David Rumsey 123 123 0
Digital Commonwealth 41,704 39,557 -5.15
Digital Library of Georgia 132,160 88,200 -33.26
Harvard Library 9,257 6,210 -32.92
HathiTrust 685,733 272,340 -60.28
Internet Archive 56,911 49,117 -13.70
J. Paul Getty Trust 2,777 2,560 -7.81
Kentucky Digital Library 1,972 1,831 -7.15
Minnesota Digital Library 24,472 24,325 -0.60
Missouri Hub 6,893 6,757 -1.97
Mountain West Digital Library 227,755 172,663 -24.19
National Archives and Records Administration 7,086 7,086 0
North Carolina Digital Heritage Center 99,258 79,353 -20.05
Smithsonian Institution 348,302 346,096 -0.63
South Carolina Digital Library 23,842 17,516 -26.53
The New York Public Library 69,210 36,709 -46.96
The Portal to Texas History 104,566 97,441 -6.81
United States Government Printing Office (GPO) 174,067 48,537 -72.12
University of Illinois at Urbana-Champaign 6,183 5,724 -7.42
University of Southern California. Libraries 65,958 64,021 -2.94
University of Virginia Library 3,736 3,664 -1.93

The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.

Hub Subjects Unique to Hub Normalized Subject Unique to Hub % Subjects Unique to Hub % Normalized Subjects Unique to Hub % Change
ARTstor 4,941 4,806 52 50 -4
Biodiversity Heritage Library 9,136 6,929 42 31 -26
David Rumsey 30 28 24 23 -4
Digital Commonwealth 31,094 27,712 75 70 -7
Digital Library of Georgia 114,689 67,768 87 77 -11
Harvard Library 7,204 3,238 78 52 -33
HathiTrust 570,292 200,652 83 74 -11
Internet Archive 28,978 23,387 51 48 -6
J. Paul Getty Trust 1,852 1,337 67 52 -22
Kentucky Digital Library 1,337 1,111 68 61 -10
Minnesota Digital Library 17,545 17,145 72 70 -3
Missouri Hub 4,338 3,783 63 56 -11
Mountain West Digital Library 192,501 134,870 85 78 -8
National Archives and Records Administration 3,589 3,399 51 48 -6
North Carolina Digital Heritage Center 84,203 62,406 85 79 -7
Smithsonian Institution 325,878 322,945 94 93 -1
South Carolina Digital Library 18,110 9,767 76 56 -26
The New York Public Library 52,002 18,075 75 49 -35
The Portal to Texas History 87,076 78,153 83 80 -4
United States Government Printing Office (GPO) 105,389 15,702 61 32 -48
University of Illinois at Urbana-Champaign 3,076 2,322 50 41 -18
University of Southern California. Libraries 51,822 48,889 79 76 -4
University of Virginia Library 2,425 1,134 65 31 -52

Conclusion

Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization.  It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs,  normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.

I hope you found this post interesting,  if you want to chat about the topic hit me up on Twitter.