In the previous post I walked through some of the different ways that we could normalize a subject string and took a look at what effects these normalizations had on the subjects in the entire DPLA metadata dataset that I have been using.
This post I wanted to continue along those lines and take a look at what happens when you apply these normalizations to the subjects in the dataset, but this time focus on the Hub level instead of working with the whole dataset.
I applied the normalizations mentioned in the previous post to the subjects from each of the Hubs in the DPLA dataset. This included total values, unique but un-normalized values, case folded, lower cased, NACO, Porter stemmed, and fingerprint. I applied the normalizations on the output of the previous normalization as a series, here is an example of what the normalization chain looked like for each.
total total > unique total > unique > case folded total > unique > case folded > lowercased total > unique > case folded > lowercased > NACO total > unique > case folded > lowercased > NACO > Porter total > unique > case folded > lowercased > NACO > Porter > fingerprint
The number of subjects after each normalization is presented in the first table below.
Hub Name | Total Subjects | Unique Subjects | Folded | Lowercase | NACO | Porter | Fingerprint |
ARTstor | 194,883 | 9,560 | 9,559 | 9,514 | 9,483 | 8,319 | 8,278 |
Biodiversity_Heritage_Library | 451,999 | 22,004 | 22,003 | 22,002 | 21,865 | 21,482 | 21,384 |
David_Rumsey | 22,976 | 123 | 123 | 122 | 121 | 121 | 121 |
Digital_Commonwealth | 295,778 | 41,704 | 41,694 | 41,419 | 40,998 | 40,095 | 39,950 |
Digital_Library_of_Georgia | 1,151,351 | 132,160 | 132,157 | 131,656 | 131,171 | 130,289 | 129,724 |
Harvard_Library | 26,641 | 9,257 | 9,251 | 9,248 | 9,236 | 9,229 | 9,059 |
HathiTrust | 2,608,567 | 685,733 | 682,188 | 676,739 | 671,203 | 667,025 | 653,973 |
Internet_Archive | 363,634 | 56,910 | 56,815 | 56,291 | 55,954 | 55,401 | 54,700 |
J_Paul_Getty_Trust | 32,949 | 2,777 | 2,774 | 2,760 | 2,741 | 2,710 | 2,640 |
Kentucky_Digital_Library | 26,008 | 1,972 | 1,972 | 1,959 | 1,900 | 1,898 | 1,892 |
Minnesota_Digital_Library | 202,456 | 24,472 | 24,470 | 23,834 | 23,680 | 22,453 | 22,282 |
Missouri_Hub | 97,111 | 6,893 | 6,893 | 6,850 | 6,792 | 6,724 | 6,696 |
Mountain_West_Digital_Library | 2,636,219 | 227,755 | 227,705 | 223,500 | 220,784 | 214,197 | 210,771 |
National_Archives_and_Records_Administration | 231,513 | 7,086 | 7,086 | 7,085 | 7,085 | 7,050 | 7,045 |
North_Carolina_Digital_Heritage_Center | 866,697 | 99,258 | 99,254 | 99,020 | 98,486 | 97,993 | 97,297 |
Smithsonian_Institution | 5,689,135 | 348,302 | 348,043 | 347,595 | 346,499 | 344,018 | 337,209 |
South_Carolina_Digital_Library | 231,267 | 23,842 | 23,838 | 23,656 | 23,291 | 23,101 | 22,993 |
The_New_York_Public_Library | 1,995,817 | 69,210 | 69,185 | 69,165 | 69,091 | 68,767 | 68,566 |
The_Portal_to_Texas_History | 5,255,588 | 104,566 | 104,526 | 103,208 | 102,195 | 98,591 | 97,589 |
United_States_Government_Printing_Office_(GPO) | 456,363 | 174,067 | 174,063 | 173,554 | 173,353 | 172,761 | 170,103 |
University_of_Illinois_at_Urbana-Champaign | 67,954 | 6,183 | 6,182 | 6,150 | 6,134 | 6,026 | 6,010 |
University_of_Southern_California_Libraries | 859,868 | 65,958 | 65,882 | 65,470 | 64,714 | 62,092 | 61,553 |
University_of_Virginia_Library | 93,378 | 3,736 | 3,736 | 3,672 | 3,660 | 3,625 | 3,618 |
Here is a table that shows the percentage reduction after each field is normalized with a specific algorithm. The percent reduction makes it a little easier to interpret.
Hub Name | Folded Normalization | Lowercase Normalization | Naco Normalization | Porter Normalization | Fingerprint Normalization |
ARTstor | 0.0% | 0.5% | 0.8% | 13.0% | 13.4% |
Biodiversity_Heritage_Library | 0.0% | 0.0% | 0.6% | 2.4% | 2.8% |
David_Rumsey | 0.0% | 0.8% | 1.6% | 1.6% | 1.6% |
Digital_Commonwealth | 0.0% | 0.7% | 1.7% | 3.9% | 4.2% |
Digital_Library_of_Georgia | 0.0% | 0.4% | 0.7% | 1.4% | 1.8% |
Harvard_Library | 0.1% | 0.1% | 0.2% | 0.3% | 2.1% |
HathiTrust | 0.5% | 1.3% | 2.1% | 2.7% | 4.6% |
Internet_Archive | 0.2% | 1.1% | 1.7% | 2.7% | 3.9% |
J_Paul_Getty_Trust | 0.1% | 0.6% | 1.3% | 2.4% | 4.9% |
Kentucky_Digital_Library | 0.0% | 0.7% | 3.7% | 3.8% | 4.1% |
Minnesota_Digital_Library | 0.0% | 2.6% | 3.2% | 8.3% | 8.9% |
Missouri_Hub | 0.0% | 0.6% | 1.5% | 2.5% | 2.9% |
Mountain_West_Digital_Library | 0.0% | 1.9% | 3.1% | 6.0% | 7.5% |
National_Archives_and_Records_Administration | 0.0% | 0.0% | 0.0% | 0.5% | 0.6% |
North_Carolina_Digital_Heritage_Center | 0.0% | 0.2% | 0.8% | 1.3% | 2.0% |
Smithsonian_Institution | 0.1% | 0.2% | 0.5% | 1.2% | 3.2% |
South_Carolina_Digital_Library | 0.0% | 0.8% | 2.3% | 3.1% | 3.6% |
The_New_York_Public_Library | 0.0% | 0.1% | 0.2% | 0.6% | 0.9% |
The_Portal_to_Texas_History | 0.0% | 1.3% | 2.3% | 5.7% | 6.7% |
United_States_Government_Printing_Office_(GPO) | 0.0% | 0.3% | 0.4% | 0.8% | 2.3% |
University_of_Illinois_at_Urbana-Champaign | 0.0% | 0.5% | 0.8% | 2.5% | 2.8% |
University_of_Southern_California_Libraries | 0.1% | 0.7% | 1.9% | 5.9% | 6.7% |
University_of_Virginia_Library | 0.0% | 1.7% | 2.0% | 3.0% | 3.2% |
Here is that data presented as a graph that I think shows the data a even better.
You can see that for many of the Hubs you see the biggest reduction happening when applying the Porter Normalization and the Fingerprint Normalization. Hubs of note are ArtStore which had the highest percentage of reduction of the hubs. This was primarily caused by the Porter normalization which means that there were a large percentage of subjects that stemmed to the same stem, often this is plural vs singular versions of the same subject. This may be completely valid with out ArtStore chose to create metadata but is still interesting.
Another hub I found interesting with this data was that from Harvard where the biggest reduction happened with the Fingerprint Normalization. This might suggest that there are a number of values that are the same just with different order. For example names that occur in both inverted and non-inverted form.
In the end I’m not sure how helpful this is as an indicator of quality within a field. There are fields that would benefit from this sort of normalization more than others. For example subjects, creator, contributor, publisher will normalize very differently than a field like title or description.
Let me know what you think via Twitter if you have questions or comments.