This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.
This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.
Background
One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific. For example the heading “Children–Texas.” is constructed from two different pieces, “Children”, and “Texas”. If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.
The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis. I had a question asked about what would happen if we normalized the subjects before we did the analysis on them, effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.
Note: Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index. I also apply this same method to the un-normalized subjects. In doing so I noticed that the item that had the most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once. Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects, each of these values should go down.
Predictions
My predictions before the analysis are that we will see an increase in the number of unique subjects, a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.
Results
With the normalization of subjects, there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization, a reduction in the number of unique subject headings by 38%.
In addition to the reduction of the total number of unique subject headings by 38% as stated above, the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%. The table below displays these numbers before and after normalization as well as the percentage change.
# of Hubs with Subject | # of Subjects | # of Normalized Subjects | % Change |
1 | 1,717,512 | 1,055,561 | -39% |
2 | 114,047 | 60,981 | -47% |
3 | 21,126 | 20,172 | -5% |
4 | 8,013 | 9,483 | 18% |
5 | 3,905 | 5,130 | 31% |
6 | 2,187 | 3,094 | 41% |
7 | 1,330 | 2,024 | 52% |
8 | 970 | 1,481 | 53% |
9 | 689 | 1,080 | 57% |
10 | 494 | 765 | 55% |
11 | 405 | 571 | 41% |
12 | 302 | 453 | 50% |
13 | 245 | 413 | 69% |
14 | 199 | 340 | 71% |
15 | 152 | 261 | 72% |
16 | 117 | 205 | 75% |
17 | 63 | 152 | 141% |
18 | 62 | 130 | 110% |
19 | 32 | 77 | 141% |
20 | 20 | 55 | 175% |
21 | 7 | 38 | 443% |
22 | 7 | 23 | 229% |
23 | 0 | 2 | N/A |
The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”
The high level stats for all 8,012,390 records are available in the following table.
Records | Total Subject Strings Count | Total Normalized Subject String Count | Average Subjects Per Record | Average Normalized Subjects Per Record | Percent Change |
8,012,390 | 23,860,080 | 28,644,188 | 2.98 | 3.57 | 20.05% |
You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.
Results by Hub
The table below presents data for each hub in the DPLA. The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.
Hub | Records | Total Subject String Count | Total Normalized Subject String Count | Average Subjects Per Record | Average Normalized Subjects Per Record | Percent Change |
ARTstor | 56,342 | 194,883 | 202,220 | 3.46 | 3.59 | 3.76 |
Biodiversity Heritage Library | 138,288 | 453,843 | 452,007 | 3.28 | 3.27 | -0.40 |
David Rumsey | 48,132 | 22,976 | 22,976 | 0.48 | 0.48 | 0 |
Digital Commonwealth | 124,804 | 295,778 | 336,935 | 2.37 | 2.7 | 13.91 |
Digital Library of Georgia | 259,640 | 1,151,351 | 1,783,884 | 4.43 | 6.87 | 54.94 |
Harvard Library | 10,568 | 26,641 | 36,511 | 2.52 | 3.45 | 37.05 |
HathiTrust | 1,915,159 | 2,608,567 | 4,154,244 | 1.36 | 2.17 | 59.25 |
Internet Archive | 208,953 | 363,634 | 412,640 | 1.74 | 1.97 | 13.48 |
J. Paul Getty Trust | 92,681 | 32,949 | 43,590 | 0.36 | 0.47 | 32.30 |
Kentucky Digital Library | 127,755 | 26,008 | 27,561 | 0.2 | 0.22 | 5.97 |
Minnesota Digital Library | 40,533 | 202,456 | 211,539 | 4.99 | 5.22 | 4.49 |
Missouri Hub | 41,557 | 97,111 | 117,933 | 2.34 | 2.84 | 21.44 |
Mountain West Digital Library | 867,538 | 2,636,219 | 3,552,268 | 3.04 | 4.09 | 34.75 |
National Archives and Records Administration | 700,952 | 231,513 | 231,513 | 0.33 | 0.33 | 0 |
North Carolina Digital Heritage Center | 260,709 | 866,697 | 1,207,488 | 3.32 | 4.63 | 39.32 |
Smithsonian Institution | 897,196 | 5,689,135 | 5,686,107 | 6.34 | 6.34 | -0.05 |
South Carolina Digital Library | 76,001 | 231,267 | 355,504 | 3.04 | 4.68 | 53.72 |
The New York Public Library | 1,169,576 | 1,995,817 | 2,515,252 | 1.71 | 2.15 | 26.03 |
The Portal to Texas History | 477,639 | 5,255,588 | 5,410,963 | 11 | 11.33 | 2.96 |
United States Government Printing Office (GPO) | 148,715 | 456,363 | 768,830 | 3.07 | 5.17 | 68.47 |
University of Illinois at Urbana-Champaign | 18,103 | 67,954 | 85,263 | 3.75 | 4.71 | 25.47 |
University of Southern California. Libraries | 301,325 | 859,868 | 905,465 | 2.85 | 3 | 5.30 |
University of Virginia Library | 30,188 | 93,378 | 123,405 | 3.09 | 4.09 | 32.16 |
The number of unique subjects before and after subject normalization is presented in the table below. The percent of change is also included in the final column.
Hub | Unique Subjects | Unique Normalized Subjects | % Change Unique |
ARTstor | 9,560 | 9,546 | -0.15 |
Biodiversity Heritage Library | 22,004 | 22,005 | 0 |
David Rumsey | 123 | 123 | 0 |
Digital Commonwealth | 41,704 | 39,557 | -5.15 |
Digital Library of Georgia | 132,160 | 88,200 | -33.26 |
Harvard Library | 9,257 | 6,210 | -32.92 |
HathiTrust | 685,733 | 272,340 | -60.28 |
Internet Archive | 56,911 | 49,117 | -13.70 |
J. Paul Getty Trust | 2,777 | 2,560 | -7.81 |
Kentucky Digital Library | 1,972 | 1,831 | -7.15 |
Minnesota Digital Library | 24,472 | 24,325 | -0.60 |
Missouri Hub | 6,893 | 6,757 | -1.97 |
Mountain West Digital Library | 227,755 | 172,663 | -24.19 |
National Archives and Records Administration | 7,086 | 7,086 | 0 |
North Carolina Digital Heritage Center | 99,258 | 79,353 | -20.05 |
Smithsonian Institution | 348,302 | 346,096 | -0.63 |
South Carolina Digital Library | 23,842 | 17,516 | -26.53 |
The New York Public Library | 69,210 | 36,709 | -46.96 |
The Portal to Texas History | 104,566 | 97,441 | -6.81 |
United States Government Printing Office (GPO) | 174,067 | 48,537 | -72.12 |
University of Illinois at Urbana-Champaign | 6,183 | 5,724 | -7.42 |
University of Southern California. Libraries | 65,958 | 64,021 | -2.94 |
University of Virginia Library | 3,736 | 3,664 | -1.93 |
The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.
Hub | Subjects Unique to Hub | Normalized Subject Unique to Hub | % Subjects Unique to Hub | % Normalized Subjects Unique to Hub | % Change |
ARTstor | 4,941 | 4,806 | 52 | 50 | -4 |
Biodiversity Heritage Library | 9,136 | 6,929 | 42 | 31 | -26 |
David Rumsey | 30 | 28 | 24 | 23 | -4 |
Digital Commonwealth | 31,094 | 27,712 | 75 | 70 | -7 |
Digital Library of Georgia | 114,689 | 67,768 | 87 | 77 | -11 |
Harvard Library | 7,204 | 3,238 | 78 | 52 | -33 |
HathiTrust | 570,292 | 200,652 | 83 | 74 | -11 |
Internet Archive | 28,978 | 23,387 | 51 | 48 | -6 |
J. Paul Getty Trust | 1,852 | 1,337 | 67 | 52 | -22 |
Kentucky Digital Library | 1,337 | 1,111 | 68 | 61 | -10 |
Minnesota Digital Library | 17,545 | 17,145 | 72 | 70 | -3 |
Missouri Hub | 4,338 | 3,783 | 63 | 56 | -11 |
Mountain West Digital Library | 192,501 | 134,870 | 85 | 78 | -8 |
National Archives and Records Administration | 3,589 | 3,399 | 51 | 48 | -6 |
North Carolina Digital Heritage Center | 84,203 | 62,406 | 85 | 79 | -7 |
Smithsonian Institution | 325,878 | 322,945 | 94 | 93 | -1 |
South Carolina Digital Library | 18,110 | 9,767 | 76 | 56 | -26 |
The New York Public Library | 52,002 | 18,075 | 75 | 49 | -35 |
The Portal to Texas History | 87,076 | 78,153 | 83 | 80 | -4 |
United States Government Printing Office (GPO) | 105,389 | 15,702 | 61 | 32 | -48 |
University of Illinois at Urbana-Champaign | 3,076 | 2,322 | 50 | 41 | -18 |
University of Southern California. Libraries | 51,822 | 48,889 | 79 | 76 | -4 |
University of Virginia Library | 2,425 | 1,134 | 65 | 31 | -52 |
Conclusion
Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization. It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs, normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.
I hope you found this post interesting, if you want to chat about the topic hit me up on Twitter.