Monthly Archives: May 2015

Effects of subject normalization on DPLA Hubs

In the previous post I walked through some of the different ways that we could normalize a subject string and took a look at what effects these normalizations had on the subjects in the entire DPLA metadata dataset that I have been using.

This post I wanted to continue along those lines and take a look at what happens when you apply these normalizations to the subjects in the dataset, but this time focus on the Hub level instead of working with the whole dataset.

I applied the normalizations mentioned in the previous post to the subjects from each of the Hubs in the DPLA dataset.  This included total values, unique but un-normalized values, case folded, lower cased, NACO, Porter stemmed, and fingerprint.  I applied the normalizations on the output of the previous normalization as a series, here is an example of what the normalization chain looked like for each.

total
total > unique
total > unique > case folded
total > unique > case folded > lowercased
total > unique > case folded > lowercased > NACO
total > unique > case folded > lowercased > NACO > Porter
total > unique > case folded > lowercased > NACO > Porter > fingerprint

The number of subjects after each normalization is presented in the first table below.

Hub Name Total Subjects Unique Subjects Folded Lowercase NACO Porter Fingerprint
ARTstor 194,883 9,560 9,559 9,514 9,483 8,319 8,278
Biodiversity_Heritage_Library 451,999 22,004 22,003 22,002 21,865 21,482 21,384
David_Rumsey 22,976 123 123 122 121 121 121
Digital_Commonwealth 295,778 41,704 41,694 41,419 40,998 40,095 39,950
Digital_Library_of_Georgia 1,151,351 132,160 132,157 131,656 131,171 130,289 129,724
Harvard_Library 26,641 9,257 9,251 9,248 9,236 9,229 9,059
HathiTrust 2,608,567 685,733 682,188 676,739 671,203 667,025 653,973
Internet_Archive 363,634 56,910 56,815 56,291 55,954 55,401 54,700
J_Paul_Getty_Trust 32,949 2,777 2,774 2,760 2,741 2,710 2,640
Kentucky_Digital_Library 26,008 1,972 1,972 1,959 1,900 1,898 1,892
Minnesota_Digital_Library 202,456 24,472 24,470 23,834 23,680 22,453 22,282
Missouri_Hub 97,111 6,893 6,893 6,850 6,792 6,724 6,696
Mountain_West_Digital_Library 2,636,219 227,755 227,705 223,500 220,784 214,197 210,771
National_Archives_and_Records_Administration 231,513 7,086 7,086 7,085 7,085 7,050 7,045
North_Carolina_Digital_Heritage_Center 866,697 99,258 99,254 99,020 98,486 97,993 97,297
Smithsonian_Institution 5,689,135 348,302 348,043 347,595 346,499 344,018 337,209
South_Carolina_Digital_Library 231,267 23,842 23,838 23,656 23,291 23,101 22,993
The_New_York_Public_Library 1,995,817 69,210 69,185 69,165 69,091 68,767 68,566
The_Portal_to_Texas_History 5,255,588 104,566 104,526 103,208 102,195 98,591 97,589
United_States_Government_Printing_Office_(GPO) 456,363 174,067 174,063 173,554 173,353 172,761 170,103
University_of_Illinois_at_Urbana-Champaign 67,954 6,183 6,182 6,150 6,134 6,026 6,010
University_of_Southern_California_Libraries 859,868 65,958 65,882 65,470 64,714 62,092 61,553
University_of_Virginia_Library 93,378 3,736 3,736 3,672 3,660 3,625 3,618

Here is a table that shows the percentage reduction after each field is normalized with a specific algorithm.  The percent reduction makes it a little easier to interpret.

Hub Name Folded Normalization Lowercase Normalization Naco Normalization Porter Normalization Fingerprint Normalization
ARTstor 0.0% 0.5% 0.8% 13.0% 13.4%
Biodiversity_Heritage_Library 0.0% 0.0% 0.6% 2.4% 2.8%
David_Rumsey 0.0% 0.8% 1.6% 1.6% 1.6%
Digital_Commonwealth 0.0% 0.7% 1.7% 3.9% 4.2%
Digital_Library_of_Georgia 0.0% 0.4% 0.7% 1.4% 1.8%
Harvard_Library 0.1% 0.1% 0.2% 0.3% 2.1%
HathiTrust 0.5% 1.3% 2.1% 2.7% 4.6%
Internet_Archive 0.2% 1.1% 1.7% 2.7% 3.9%
J_Paul_Getty_Trust 0.1% 0.6% 1.3% 2.4% 4.9%
Kentucky_Digital_Library 0.0% 0.7% 3.7% 3.8% 4.1%
Minnesota_Digital_Library 0.0% 2.6% 3.2% 8.3% 8.9%
Missouri_Hub 0.0% 0.6% 1.5% 2.5% 2.9%
Mountain_West_Digital_Library 0.0% 1.9% 3.1% 6.0% 7.5%
National_Archives_and_Records_Administration 0.0% 0.0% 0.0% 0.5% 0.6%
North_Carolina_Digital_Heritage_Center 0.0% 0.2% 0.8% 1.3% 2.0%
Smithsonian_Institution 0.1% 0.2% 0.5% 1.2% 3.2%
South_Carolina_Digital_Library 0.0% 0.8% 2.3% 3.1% 3.6%
The_New_York_Public_Library 0.0% 0.1% 0.2% 0.6% 0.9%
The_Portal_to_Texas_History 0.0% 1.3% 2.3% 5.7% 6.7%
United_States_Government_Printing_Office_(GPO) 0.0% 0.3% 0.4% 0.8% 2.3%
University_of_Illinois_at_Urbana-Champaign 0.0% 0.5% 0.8% 2.5% 2.8%
University_of_Southern_California_Libraries 0.1% 0.7% 1.9% 5.9% 6.7%
University_of_Virginia_Library 0.0% 1.7% 2.0% 3.0% 3.2%

Here is that data presented as a graph that I think shows the data a even better.

Reduction Percent after Normalization

Reduction Percent after Normalization

You can see that for many of the Hubs you see the biggest reduction happening when applying the Porter Normalization and the Fingerprint Normalization.  Hubs of note are ArtStore which had the highest percentage of reduction of the hubs.  This was primarily caused by the Porter normalization which means that there were a large percentage of subjects that stemmed to the same stem, often this is plural vs singular versions of the same subject.  This may be completely valid with out ArtStore chose to create metadata but is still interesting.

Another hub I found interesting with this data was that from Harvard where the biggest reduction happened with the Fingerprint Normalization.  This might suggest that there are a number of values that are the same just with different order.  For example names that occur in both inverted and non-inverted form.

In the end I’m not sure how helpful this is as an indicator of quality within a field. There are fields that would benefit from this sort of normalization more than others.  For example subjects, creator, contributor, publisher will normalize very differently than a field like title or description.

Let me know what you think via Twitter if you have questions or comments.

Metadata normalization as an indicator of quality?

Metadata quality and assessment is a concept that has been around for decades in the library community.  Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana.  Both of these groups make available their metadata so that others can remix and reuse the data in new ways.

I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields.  This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”.  Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.

This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.

One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry.  Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.

OpenRefine Cluster and Edit Screen Capture

OpenRefine Cluster and Edit Screen Capture

OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation.  Depending on ones data one approach may perform better than others for this kind of clustering.

Normalization

Case normalization is probably the easiest to kind of normalization to understand.  If you have two strings,  say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.

Case folding is another kind of transformation that is fairly common in the world of libraries.  This is the process of taking a string like “José” and converting it to “Jose”.  While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.

In addition to case folding and lower casing, libraries have been normalizing data for a long time,  there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings.  Often referred to as NACO normalizations rules, they are Authority File Comparison Rules.  I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service.  In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.

Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.

What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.

Here is an example.

If we have the following sets of subjects:

 Musical Instruments
 Musical Instruments.
 Musical instrument
 Musical instruments
 Musical instruments,
 Musical instruments.

If you applied the simplified NACO normalization from pynaco you would end up with the following strings:

musical instruments
musical instruments
musical instrument
musical instruments
musical instruments
musical instruments

If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:

music instrument
music instrument
music instrument
music instrument
music instrument
music instrument

So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.

Experiment

In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset.  I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.

I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.

d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers
d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States
d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms)
4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago
4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago
4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago
63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York
63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York
1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais

Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.

Total vs Unique

The first thing I did was to make the 23,858,236 long text file only contain unique values.  I do this with the tried and true method of using unix sort and uniq. 

sort subjects_all.txt | uniq > subjects_uniq.txt

After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.

Case folding

Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.

I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects.  That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects.  So nothing huge.

Lowercase

The next normalization tested was lowercasing of the values.  I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.

By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.

NACO Normalization

Next we look at the simple NACO normalization from pynaco.  I applied this to the unique lower cased subjects from the previous step.

With the NACO normalization,  I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects.  This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.

Porter stemming

Moving along,  I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step.  I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.

With the Portal stemmer applied,  I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.

Fingerprint

Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature.  This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”.  I used the output of the previous Porter stemming step as the input for this normalization.

With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.

Overview

Reduction Occurrences Percent Reduction
Unique 0 1,871,882 0%
Case Folded 4,753 1,867,129 0.3%
Lowercase 22,200 1,849,682 1.2%
NACO 45,359 1,826,523 2.4%
Porter 70,768 1,801,114 3.8%
Fingerprint 105,393 1,766,489 5.6%

Conclusion

I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.

I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on.  One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string.  I wrote about the effect on the subjects for the DPLA dataset in a previous post.

As always feel free to contact me via Twitter if you have questions or comments.