DPLA Metadata Analysis: Part 4 – Normalized Subjects

This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.

This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.

Background

One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific. For example the heading “Children–Texas.” is constructed from two different pieces, “Children”, and “Texas”. If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.

The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis. I had a question asked about what would happen if we normalized the subjects before we did the analysis on them, effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.

Note: Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index. I also apply this same method to the un-normalized subjects. In doing so I noticed that the item that had the most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once. Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects, each of these values should go down.

Predictions

My predictions before the analysis are that we will see an increase in the number of unique subjects, a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.

Results

With the normalization of subjects, there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization, a reduction in the number of unique subject headings by 38%.

In addition to the reduction of the total number of unique subject headings by 38% as stated above, the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%. The table below displays these numbers before and after normalization as well as the percentage change.

# of Hubs with Subject	# of Subjects	# of Normalized Subjects	% Change
1	1,717,512	1,055,561	-39%
2	114,047	60,981	-47%
3	21,126	20,172	-5%
4	8,013	9,483	18%
5	3,905	5,130	31%
6	2,187	3,094	41%
7	1,330	2,024	52%
8	970	1,481	53%
9	689	1,080	57%
10	494	765	55%
11	405	571	41%
12	302	453	50%
13	245	413	69%
14	199	340	71%
15	152	261	72%
16	117	205	75%
17	63	152	141%
18	62	130	110%
19	32	77	141%
20	20	55	175%
21	7	38	443%
22	7	23	229%
23	0	2	N/A

The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”

The high level stats for all 8,012,390 records are available in the following table.

Records	Total Subject Strings Count	Total Normalized Subject String Count	Average Subjects Per Record	Average Normalized Subjects Per Record	Percent Change
8,012,390	23,860,080	28,644,188	2.98	3.57	20.05%

You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.

Results by Hub

The table below presents data for each hub in the DPLA. The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.

Hub	Records	Total Subject String Count	Total Normalized Subject String Count	Average Subjects Per Record	Average Normalized Subjects Per Record	Percent Change
ARTstor	56,342	194,883	202,220	3.46	3.59	3.76
Biodiversity Heritage Library	138,288	453,843	452,007	3.28	3.27	-0.40
David Rumsey	48,132	22,976	22,976	0.48	0.48	0
Digital Commonwealth	124,804	295,778	336,935	2.37	2.7	13.91
Digital Library of Georgia	259,640	1,151,351	1,783,884	4.43	6.87	54.94
Harvard Library	10,568	26,641	36,511	2.52	3.45	37.05
HathiTrust	1,915,159	2,608,567	4,154,244	1.36	2.17	59.25
Internet Archive	208,953	363,634	412,640	1.74	1.97	13.48
J. Paul Getty Trust	92,681	32,949	43,590	0.36	0.47	32.30
Kentucky Digital Library	127,755	26,008	27,561	0.2	0.22	5.97
Minnesota Digital Library	40,533	202,456	211,539	4.99	5.22	4.49
Missouri Hub	41,557	97,111	117,933	2.34	2.84	21.44
Mountain West Digital Library	867,538	2,636,219	3,552,268	3.04	4.09	34.75
National Archives and Records Administration	700,952	231,513	231,513	0.33	0.33	0
North Carolina Digital Heritage Center	260,709	866,697	1,207,488	3.32	4.63	39.32
Smithsonian Institution	897,196	5,689,135	5,686,107	6.34	6.34	-0.05
South Carolina Digital Library	76,001	231,267	355,504	3.04	4.68	53.72
The New York Public Library	1,169,576	1,995,817	2,515,252	1.71	2.15	26.03
The Portal to Texas History	477,639	5,255,588	5,410,963	11	11.33	2.96
United States Government Printing Office (GPO)	148,715	456,363	768,830	3.07	5.17	68.47
University of Illinois at Urbana-Champaign	18,103	67,954	85,263	3.75	4.71	25.47
University of Southern California. Libraries	301,325	859,868	905,465	2.85	3	5.30
University of Virginia Library	30,188	93,378	123,405	3.09	4.09	32.16

The number of unique subjects before and after subject normalization is presented in the table below. The percent of change is also included in the final column.

Hub	Unique Subjects	Unique Normalized Subjects	% Change Unique
ARTstor	9,560	9,546	-0.15
Biodiversity Heritage Library	22,004	22,005	0
David Rumsey	123	123	0
Digital Commonwealth	41,704	39,557	-5.15
Digital Library of Georgia	132,160	88,200	-33.26
Harvard Library	9,257	6,210	-32.92
HathiTrust	685,733	272,340	-60.28
Internet Archive	56,911	49,117	-13.70
J. Paul Getty Trust	2,777	2,560	-7.81
Kentucky Digital Library	1,972	1,831	-7.15
Minnesota Digital Library	24,472	24,325	-0.60
Missouri Hub	6,893	6,757	-1.97
Mountain West Digital Library	227,755	172,663	-24.19
National Archives and Records Administration	7,086	7,086	0
North Carolina Digital Heritage Center	99,258	79,353	-20.05
Smithsonian Institution	348,302	346,096	-0.63
South Carolina Digital Library	23,842	17,516	-26.53
The New York Public Library	69,210	36,709	-46.96
The Portal to Texas History	104,566	97,441	-6.81
United States Government Printing Office (GPO)	174,067	48,537	-72.12
University of Illinois at Urbana-Champaign	6,183	5,724	-7.42
University of Southern California. Libraries	65,958	64,021	-2.94
University of Virginia Library	3,736	3,664	-1.93

The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.

Hub	Subjects Unique to Hub	Normalized Subject Unique to Hub	% Subjects Unique to Hub	% Normalized Subjects Unique to Hub	% Change
ARTstor	4,941	4,806	52	50	-4
Biodiversity Heritage Library	9,136	6,929	42	31	-26
David Rumsey	30	28	24	23	-4
Digital Commonwealth	31,094	27,712	75	70	-7
Digital Library of Georgia	114,689	67,768	87	77	-11
Harvard Library	7,204	3,238	78	52	-33
HathiTrust	570,292	200,652	83	74	-11
Internet Archive	28,978	23,387	51	48	-6
J. Paul Getty Trust	1,852	1,337	67	52	-22
Kentucky Digital Library	1,337	1,111	68	61	-10
Minnesota Digital Library	17,545	17,145	72	70	-3
Missouri Hub	4,338	3,783	63	56	-11
Mountain West Digital Library	192,501	134,870	85	78	-8
National Archives and Records Administration	3,589	3,399	51	48	-6
North Carolina Digital Heritage Center	84,203	62,406	85	79	-7
Smithsonian Institution	325,878	322,945	94	93	-1
South Carolina Digital Library	18,110	9,767	76	56	-26
The New York Public Library	52,002	18,075	75	49	-35
The Portal to Texas History	87,076	78,153	83	80	-4
United States Government Printing Office (GPO)	105,389	15,702	61	32	-48
University of Illinois at Urbana-Champaign	3,076	2,322	50	41	-18
University of Southern California. Libraries	51,822	48,889	79	76	-4
University of Virginia Library	2,425	1,134	65	31	-52

Conclusion

Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization. It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs, normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.

I hope you found this post interesting, if you want to chat about the topic hit me up on Twitter.