Effects of subject normalization on DPLA Hubs

In the previous post I walked through some of the different ways that we could normalize a subject string and took a look at what effects these normalizations had on the subjects in the entire DPLA metadata dataset that I have been using.

This post I wanted to continue along those lines and take a look at what happens when you apply these normalizations to the subjects in the dataset, but this time focus on the Hub level instead of working with the whole dataset.

I applied the normalizations mentioned in the previous post to the subjects from each of the Hubs in the DPLA dataset.  This included total values, unique but un-normalized values, case folded, lower cased, NACO, Porter stemmed, and fingerprint.  I applied the normalizations on the output of the previous normalization as a series, here is an example of what the normalization chain looked like for each.

total
total > unique
total > unique > case folded
total > unique > case folded > lowercased
total > unique > case folded > lowercased > NACO
total > unique > case folded > lowercased > NACO > Porter
total > unique > case folded > lowercased > NACO > Porter > fingerprint

The number of subjects after each normalization is presented in the first table below.

Hub Name Total Subjects Unique Subjects Folded Lowercase NACO Porter Fingerprint
ARTstor 194,883 9,560 9,559 9,514 9,483 8,319 8,278
Biodiversity_Heritage_Library 451,999 22,004 22,003 22,002 21,865 21,482 21,384
David_Rumsey 22,976 123 123 122 121 121 121
Digital_Commonwealth 295,778 41,704 41,694 41,419 40,998 40,095 39,950
Digital_Library_of_Georgia 1,151,351 132,160 132,157 131,656 131,171 130,289 129,724
Harvard_Library 26,641 9,257 9,251 9,248 9,236 9,229 9,059
HathiTrust 2,608,567 685,733 682,188 676,739 671,203 667,025 653,973
Internet_Archive 363,634 56,910 56,815 56,291 55,954 55,401 54,700
J_Paul_Getty_Trust 32,949 2,777 2,774 2,760 2,741 2,710 2,640
Kentucky_Digital_Library 26,008 1,972 1,972 1,959 1,900 1,898 1,892
Minnesota_Digital_Library 202,456 24,472 24,470 23,834 23,680 22,453 22,282
Missouri_Hub 97,111 6,893 6,893 6,850 6,792 6,724 6,696
Mountain_West_Digital_Library 2,636,219 227,755 227,705 223,500 220,784 214,197 210,771
National_Archives_and_Records_Administration 231,513 7,086 7,086 7,085 7,085 7,050 7,045
North_Carolina_Digital_Heritage_Center 866,697 99,258 99,254 99,020 98,486 97,993 97,297
Smithsonian_Institution 5,689,135 348,302 348,043 347,595 346,499 344,018 337,209
South_Carolina_Digital_Library 231,267 23,842 23,838 23,656 23,291 23,101 22,993
The_New_York_Public_Library 1,995,817 69,210 69,185 69,165 69,091 68,767 68,566
The_Portal_to_Texas_History 5,255,588 104,566 104,526 103,208 102,195 98,591 97,589
United_States_Government_Printing_Office_(GPO) 456,363 174,067 174,063 173,554 173,353 172,761 170,103
University_of_Illinois_at_Urbana-Champaign 67,954 6,183 6,182 6,150 6,134 6,026 6,010
University_of_Southern_California_Libraries 859,868 65,958 65,882 65,470 64,714 62,092 61,553
University_of_Virginia_Library 93,378 3,736 3,736 3,672 3,660 3,625 3,618

Here is a table that shows the percentage reduction after each field is normalized with a specific algorithm.  The percent reduction makes it a little easier to interpret.

Hub Name Folded Normalization Lowercase Normalization Naco Normalization Porter Normalization Fingerprint Normalization
ARTstor 0.0% 0.5% 0.8% 13.0% 13.4%
Biodiversity_Heritage_Library 0.0% 0.0% 0.6% 2.4% 2.8%
David_Rumsey 0.0% 0.8% 1.6% 1.6% 1.6%
Digital_Commonwealth 0.0% 0.7% 1.7% 3.9% 4.2%
Digital_Library_of_Georgia 0.0% 0.4% 0.7% 1.4% 1.8%
Harvard_Library 0.1% 0.1% 0.2% 0.3% 2.1%
HathiTrust 0.5% 1.3% 2.1% 2.7% 4.6%
Internet_Archive 0.2% 1.1% 1.7% 2.7% 3.9%
J_Paul_Getty_Trust 0.1% 0.6% 1.3% 2.4% 4.9%
Kentucky_Digital_Library 0.0% 0.7% 3.7% 3.8% 4.1%
Minnesota_Digital_Library 0.0% 2.6% 3.2% 8.3% 8.9%
Missouri_Hub 0.0% 0.6% 1.5% 2.5% 2.9%
Mountain_West_Digital_Library 0.0% 1.9% 3.1% 6.0% 7.5%
National_Archives_and_Records_Administration 0.0% 0.0% 0.0% 0.5% 0.6%
North_Carolina_Digital_Heritage_Center 0.0% 0.2% 0.8% 1.3% 2.0%
Smithsonian_Institution 0.1% 0.2% 0.5% 1.2% 3.2%
South_Carolina_Digital_Library 0.0% 0.8% 2.3% 3.1% 3.6%
The_New_York_Public_Library 0.0% 0.1% 0.2% 0.6% 0.9%
The_Portal_to_Texas_History 0.0% 1.3% 2.3% 5.7% 6.7%
United_States_Government_Printing_Office_(GPO) 0.0% 0.3% 0.4% 0.8% 2.3%
University_of_Illinois_at_Urbana-Champaign 0.0% 0.5% 0.8% 2.5% 2.8%
University_of_Southern_California_Libraries 0.1% 0.7% 1.9% 5.9% 6.7%
University_of_Virginia_Library 0.0% 1.7% 2.0% 3.0% 3.2%

Here is that data presented as a graph that I think shows the data a even better.

Reduction Percent after Normalization

Reduction Percent after Normalization

You can see that for many of the Hubs you see the biggest reduction happening when applying the Porter Normalization and the Fingerprint Normalization.  Hubs of note are ArtStore which had the highest percentage of reduction of the hubs.  This was primarily caused by the Porter normalization which means that there were a large percentage of subjects that stemmed to the same stem, often this is plural vs singular versions of the same subject.  This may be completely valid with out ArtStore chose to create metadata but is still interesting.

Another hub I found interesting with this data was that from Harvard where the biggest reduction happened with the Fingerprint Normalization.  This might suggest that there are a number of values that are the same just with different order.  For example names that occur in both inverted and non-inverted form.

In the end I’m not sure how helpful this is as an indicator of quality within a field. There are fields that would benefit from this sort of normalization more than others.  For example subjects, creator, contributor, publisher will normalize very differently than a field like title or description.

Let me know what you think via Twitter if you have questions or comments.