Category Archives: thinking outloud

Punctuation in DPLA subject strings

For the past few weeks I’ve been curious about the punctuation characters that are being used in the subject strings in the DPLA dataset I’ve been using for some blog posts over the past few months.

This post is an attempt to find out the range of punctuation characters used in these subject strings and is carried over from last week’s post related to subject string metrics.

What got me started was that in the analysis used for last week’s post, I noticed that there were a number of instances of em dashes “—” (528 instances) and en dashes “–” (822 instances) being used in place of double hyphens “–” in subject strings from The Portal to Texas History. No doubt these were most likely copied from some other source. Here is a great subject string that contains all three characters listed above.

Real Property — Texas –- Zavala County — Maps

Turns out this isn’t just something that happened in the Portal data, here is an example from the Mountain West Digital Library.

Highway planning--Environmental aspects–Arizona—Periodicals

To get the analysis started the first thing that I need to do is establish what I’m considering punctuation characters because that definition can change depending on who you are talking to and what language you are using. For this analysis I’m using the punctuation listed in the python string module.

>>> import string
>>> print string.punctuation
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

So this gives us 32 characters that I’m considering to be punctuation characters for the analysis in this post.

The first thing I wanted to do was to get an idea of which of the 32 characters were present in the subject strings, and how many instances there were. In the dataset I’m using there are 1,871,877 unique subject strings. Of those subject strings 1,496,769 or 80% have one or more punctuation characters present.

Here is the breakdown of the number of subjects that have a specific character present. One thing to note is that when processing if there were repeated instance of a character, they were reduced to a single instance, it doesn’t affect the analysis just something to note.

Character	Subjects with Character
!	72
“	1,066
#	432
$	57
%	16
&	33,825
‘	22,671
(	238,252
)	238,068
*	451
+	81
,	607,849
–	954,992
.	327,404
/	3,217
:	10,774
;	5,166
<	1,028
=	1,027
>	1,027
?	7,005
@	53
[	9,872
]	9,893
\	32
^	1
_	80
`	99
{	9
\|	72
}	9
~	4

One thing that I found interesting is that characters () and [] have different numbers of instances suggesting there are unbalanced brackets and parenthesis in subjects somewhere.

Another interesting note is that there are 72 instances of subjects that use the pipe character “|”. The pipe is often used by programmers and developers as a delimiter because it “is rarely used in the data values” this analysis says that while true it is rarely used, it should be kept in mind that it is sometimes used.

Next up was to look at how punctuation was distributed across the various Hubs.

In the table below I’ve pulled out the total number of unique subjects per Hub in the DPLA dataset. I show the number of subjects without punctuation and the number of subjects with some sort of punctuation and finally display the percentage of subjects with punctuation.

Hub Name	Unique Subjects	Subjects without Punctuation	Subjects with Punctuation	Percent with Punctuation
ARTstor	9,560	6,093	3,467	36.3%
Biodiversity_Heritage_Library	22,004	14,936	7,068	32.1%
David_Rumsey	123	106	17	13.8%
Harvard_Library	9,257	553	8,704	94.0%
HathiTrust	685,733	56,950	628,783	91.7%
Internet_Archive	56,910	17,909	39,001	68.5%
J._Paul_Getty_Trust	2,777	375	2,402	86.5%
National_Archives_and_Records_Administration	7,086	2,150	4,936	69.7%
Smithsonian_Institution	348,302	152,850	195,452	56.1%
The_New_York_Public_Library	69,210	9,202	60,008	86.7%
United_States_Government_Printing_Office_(GPO)	174,067	14,525	159,542	91.7%
University_of_Illinois_at_Urbana-Champaign	6,183	2,132	4,051	65.5%
University_of_Southern_California._Libraries	65,958	37,237	28,721	43.5%
University_of_Virginia_Library	3,736	1,099	2,637	70.6%
Digital_Commonwealth	41,704	8,381	33,323	79.9%
Digital_Library_of_Georgia	132,160	9,876	122,284	92.5%
Kentucky_Digital_Library	1,972	579	1,393	70.6%
Minnesota_Digital_Library	24,472	16,555	7,917	32.4%
Missouri_Hub	6,893	2,410	4,483	65.0%
Mountain_West_Digital_Library	227,755	84,452	143,303	62.9%
North_Carolina_Digital_Heritage_Center	99,258	9,253	90,005	90.7%
South_Carolina_Digital_Library	23,842	4,002	19,840	83.2%
The_Portal_to_Texas_History	104,566	40,310	64,256	61.5%

To make it a little easier to see I make a graph of this same data and divided the graph into two groups, on the left are the Content-Hubs and the right are the Service-Hubs.

Percent of Subjects with Punctuation

I don’t see a huge difference between the two groups and the percentage of punctuation in subjects, at least by just looking at things.

Next I wanted to see out of the 32 characters that I’m considering in this post, how many of those characters are present in a given hubs subjects. That data is in the table and graph below.

Hub Name	Characters Present
ARTstor	19
Biodiversity_Heritage_Library	20
David_Rumsey	7
Digital_Commonwealth	21
Digital_Library_of_Georgia	22
Harvard_Library	12
HathiTrust	28
Internet_Archive	26
J._Paul_Getty_Trust	11
Kentucky_Digital_Library	11
Minnesota_Digital_Library	16
Missouri_Hub	14
Mountain_West_Digital_Library	30
National_Archives_and_Records_Administration	10
North_Carolina_Digital_Heritage_Center	23
Smithsonian_Institution	26
South_Carolina_Digital_Library	16
The_New_York_Public_Library	18
The_Portal_to_Texas_History	22
United_States_Government_Printing_Office_(GPO)	17
University_of_Illinois_at_Urbana-Champaign	12
University_of_Southern_California._Libraries	25
University_of_Virginia_Library	13

Here is this data in a graph grouped in Content and Service Hubs.

Unique Punctuation Characters Present

Mountain West Digital Library had the most characters covered with 30 of the 32 possible punctuation characters. One the low end was the David Rumsey collection with only 7 characters represented in the subject data.

The final thing is to see the character usage for all characters divided by hub so the following graphic presents that data. I tried to do a little coloring of the table to make it a bit easier to read, don’t know how well I accomplished that.

Punctuation Character Usage (click to view larger image)

So it looks like the following characters ‘(),-. are present in all of the hubs. The characters %/?: are present in almost all of the hubs (missing one hub each).

The least used character is the ^ which is only in use by one hub in one record. The characters ~ and @ are only used in two hubs each.

I’ve found this quick look at the punctuation usage in subjects pretty interesting so far, I know that there were some anomalies that I unearthed for the Portal dataset with this work that we now have on the board to fix, they aren’t huge issues but things that probably would stick around for quite some time in a set of records without specific identification.

For me the next step is to see if there is a way to identify punctuation characters that are used incorrectly and be able to flag those fields and records in some way to report back to metadata creators.

Let me know what you think via Twitter if you have questions or comments.

Characteristics of subjects in the DPLA

There are still a few things that I have been wanting to do with the subject data from the DPLA dataset that I’ve been working with for the past few months.

This time I wanted to take a look at some of the characteristics of the subject strings themselves and see if there is any information there that is helpful, useful for us to look at as an indicator of quality for the metadata record associated with that subject.

I took at look at the following metrics for each subject string; length, percentage integer, number of tokens, length of anagram, anagram complexity, number of non-alphanumeric characters (punctuation).

In the tables below I present a few of the more interesting selections from the data.

Subject Length

This is calculated by stripping whitespace from the ends of each subject, and then counting the number of characters that are left in the string.

Hub	Unique Subjects	Minimum Length	Median Length	Maximum Length	Average Length	stddev
ARTstor	9,560	3	12.0	201	16.6	14.4
Biodiversity_Heritage_Library	22,004	3	10.5	478	16.4	10.0
David_Rumsey	123	3	18.0	30	11.3	5.2
Digital_Commonwealth	41,704	3	17.5	3490	19.6	26.7
Digital_Library_of_Georgia	132,160	3	18.5	169	27.1	14.1
Harvard_Library	9,257	3	17.0	110	30.2	12.6
HathiTrust	685,733	3	31.0	728	36.8	16.6
Internet_Archive	56,910	3	152.0	1714	38.1	48.4
J._Paul_Getty_Trust	2,777	4	65.0	99	31.6	15.5
Kentucky_Digital_Library	1,972	3	31.5	129	33.9	18.0
Minnesota_Digital_Library	24,472	3	19.5	199	17.4	10.2
Missouri_Hub	6,893	3	182.0	525	30.3	40.4
Mountain_West_Digital_Library	227,755	3	12.0	3148	27.2	25.1
National_Archives_and_Records_Administration	7,086	3	19.0	166	22.7	17.9
North_Carolina_Digital_Heritage_Center	99,258	3	9.5	3192	25.6	20.2
Smithsonian_Institution	348,302	3	14.0	182	24.2	11.9
South_Carolina_Digital_Library	23,842	3	26.5	1182	35.7	25.9
The_New_York_Public_Library	69,210	3	29.0	119	29.4	13.5
The_Portal_to_Texas_History	104,566	3	16.0	152	17.7	9.7
United_States_Government_Printing_Office_(GPO)	174,067	3	39.0	249	43.5	18.1
University_of_Illinois_at_Urbana-Champaign	6,183	3	23.0	141	23.2	14.3
University_of_Southern_California._Libraries	65,958	3	13.5	211	18.4	10.7
University_of_Virginia_Library	3,736	3	40.5	102	31.0	17.7

My takeaway from this is that three characters long is just about the shortest subject that one is able to include, not the absolute rule, but that is the low end for this data.

The average length ranges from 11.3 average characters for the David Rumsey hub to 43.5 characters on average for the United States Government Printing Office (GPO).

Put into a graph you can see the average subject length across the Hubs a bit easier.

Average Subject Length

The length of a field can be helpful to find values that are a bit outside of the norm. For example you can see that there are five Hubs that have maximum character lengths of over 1,000 characters. In a quick investigation of these values they appear to be abstracts and content descriptions accidentally coded as a subject.

Maximum Subject Length

For the Portal to Texas History that had a few subjects that came in at over 152 characters long, it turns out that these are incorrectly formatted subject fields where a user has included a number of subjects in one field instead of separating them out into multiple fields.

Percent Integer

For this metric I stripped whitespace characters, and then divided the number of digit characters by the number of total characters in the string to come up with the percentage integer.

Hub	Unique Subjects	Maximum % Integer	Average % Integer	stddev
ARTstor	9,560	61.5	1.3	5.2
Biodiversity_Heritage_Library	22,004	92.3	2.2	11.1
David_Rumsey	123	36.4	0.5	4.2
Digital_Commonwealth	41,704	66.7	1.6	6.0
Digital_Library_of_Georgia	132,160	87.5	1.7	6.2
Harvard_Library	9,257	44.4	4.6	9.0
HathiTrust	685,733	100.0	3.5	8.4
Internet_Archive	56,910	100.0	4.1	9.4
J._Paul_Getty_Trust	2,777	50.0	3.6	8.0
Kentucky_Digital_Library	1,972	63.6	5.7	9.9
Minnesota_Digital_Library	24,472	80.0	1.1	5.1
Missouri_Hub	6,893	50.0	2.9	7.5
Mountain_West_Digital_Library	227,755	100.0	1.1	5.5
National_Archives_and_Records_Administration	7,086	42.1	4.7	9.4
North_Carolina_Digital_Heritage_Center	99,258	100.0	1.5	5.9
Smithsonian_Institution	348,302	100.0	1.1	3.6
South_Carolina_Digital_Library	23,842	57.1	2.3	6.5
The_New_York_Public_Library	69,210	100.0	12.0	13.5
The_Portal_to_Texas_History	104,566	100.0	0.4	3.7
United_States_Government_Printing_Office_(GPO)	174,067	80.0	0.4	2.4
University_of_Illinois_at_Urbana-Champaign	6,183	50.0	6.1	10.9
University_of_Southern_California._Libraries	65,958	100.0	1.3	6.4
University_of_Virginia_Library	3,736	72.7	1.8	6.8

Average Percent Integer

If you group these into the Content-Hub and Service-Hub categories you can see things a little better.

It appears that the Content-Hubs on the left trend a bit higher than the Service-Hubs on the right. This probably has to do with the use of dates in subject strings as a common practice in bibliographic catalog based metadata which isn’t always the same in metadata created for more heterogeneous collections of content that we see in the Service-Hubs.

Tokens

For the tokens metric I replaced punctuation character instance with a single space character and then used the nltk word_tokenize function to return a list of tokens. I then just to the length of that resulting list for the metric.

Hub	Unique Subjects	Maximum Tokens	Average Tokens	stddev
ARTstor	9,560	31	2.36	2.12
Biodiversity_Heritage_Library	22,004	66	2.29	1.46
David_Rumsey	123	5	1.63	0.94
Digital_Commonwealth	41,704	469	2.78	3.70
Digital_Library_of_Georgia	132,160	23	3.70	1.72
Harvard_Library	9,257	17	4.07	1.77
HathiTrust	685,733	107	4.75	2.31
Internet_Archive	56,910	244	5.06	6.21
J._Paul_Getty_Trust	2,777	15	4.11	2.14
Kentucky_Digital_Library	1,972	20	4.65	2.50
Minnesota_Digital_Library	24,472	25	2.66	1.54
Missouri_Hub	6,893	68	4.30	5.41
Mountain_West_Digital_Library	227,755	549	3.64	3.51
National_Archives_and_Records_Administration	7,086	26	3.48	2.93
North_Carolina_Digital_Heritage_Center	99,258	493	3.75	2.64
Smithsonian_Institution	348,302	25	3.29	1.56
South_Carolina_Digital_Library	23,842	180	4.87	3.45
The_New_York_Public_Library	69,210	20	4.28	2.14
The_Portal_to_Texas_History	104,566	23	2.69	1.36
United_States_Government_Printing_Office_(GPO)	174,067	41	5.31	2.28
University_of_Illinois_at_Urbana-Champaign	6,183	26	3.35	2.11
University_of_Southern_California._Libraries	65,958	36	2.66	1.51
University_of_Virginia_Library	3,736	15	4.62	2.84

Average number of tokens

Tokens end up being very similar to that of the overall character length of a subject. If I was to do more processing I would probably divide the length by the number of tokens and get an average work length for the tokens in the subjects. That might be interesting.

Anagram

I’ve always found anagrams of values in metadata to be interesting, sometimes helpful and sometimes completely useless. For this value I folded the case of the subject string to convert letters with diacritics to their ASCII version and then created an anagram of the resulting letters. I used the length of this anagram for the metric.

Hub	Unique Subjects	Min Anagram Length	Median Anagram Length	Max Anagram Length	Avg Anagram Length	stddev
ARTstor	9,560	2	8	23	8.93	3.63
Biodiversity_Heritage_Library	22,004	0	7.5	23	9.33	3.26
David_Rumsey	123	3	12	13	7.93	2.28
Digital_Commonwealth	41,704	0	9	26	9.97	3.01
Digital_Library_of_Georgia	132,160	0	9.5	23	11.74	3.18
Harvard_Library	9,257	3	11	21	12.51	2.92
HathiTrust	685,733	0	14	25	13.56	2.98
Internet_Archive	56,910	0	22	26	12.41	3.96
J._Paul_Getty_Trust	2,777	3	19	21	13.02	3.60
Kentucky_Digital_Library	1,972	2	14.5	22	13.02	3.28
Minnesota_Digital_Library	24,472	0	12	22	9.76	3.00
Missouri_Hub	6,893	0	22	25	11.09	4.06
Mountain_West_Digital_Library	227,755	0	7	26	11.85	3.54
National_Archives_and_Records_Administration	7,086	3	11	22	10.01	3.09
North_Carolina_Digital_Heritage_Center	99,258	0	6	26	11.00	3.54
Smithsonian_Institution	348,302	0	8	23	11.53	3.42
South_Carolina_Digital_Library	23,842	1	12	26	13.08	3.67
The_New_York_Public_Library	69,210	0	10	24	11.45	3.17
The_Portal_to_Texas_History	104,566	0	10.5	23	9.78	2.98
United_States_Government_Printing_Office_(GPO)	174,067	0	14	24	14.56	2.80
University_of_Illinois_at_Urbana-Champaign	6,183	3	7	21	10.42	3.46
University_of_Southern_California._Libraries	65,958	0	9	23	9.81	3.20
University_of_Virginia_Library	3,736	0	9	22	12.76	4.31

Average anagram length

I find this interesting in that there are subjects in several of the Hubs (Digital_Commonwealth, Internet Archive, Mountain West Digital Library, and South Carolina Digital Library that have a single subject instance that contains all 26 letters. That’s just neat. Now I didn’t look to see if these are the same subject instances that were themselves 3000+ characters long.

North_Carolina_Digital_Heritage_Center

Punctuation

It can be interesting to see what punctuation was used in a field so I extracted all non-alphanumeric values from the string which left me with the punctuation characters. I took the number of unique punctuation characters for this metric.

Hub Name	Unique Subjects	min	median	max	mean	stddev
ARTstor	9,560	0	0	8	0.73	1.22
Biodiversity Heritage Library	22,004	0	0	8	0.59	1.02
David Rumsey	123	0	0	4	0.18	0.53
Digital Commonwealth	41,704	0	1.5	10	1.21	1.10
Digital Library of Georgia	132,160	0	1	7	1.34	0.96
Harvard_Library	9,257	0	0	6	1.65	1.02
HathiTrust	685,733	0	1	9	1.63	1.16
Internet_Archive	56,910	0	2	11	1.47	1.75
J_Paul_Getty_Trust	2,777	0	2	6	1.58	0.99
Kentucky_Digital_Library	1,972	0	1.5	5	1.50	1.38
Minnesota_Digital_Library	24,472	0	0	7	0.42	0.74
Missouri_Hub	6,893	0	3	7	1.24	1.37
Mountain_West_Digital_Library	227,755	0	1	8	0.97	1.04
National_Archives_and_Records_Administration	7,086	0	3	7	1.68	1.61
North_Carolina_Digital_Heritage_Center	99,258	0	0.5	7	1.34	0.93
Smithsonian_Institution	348,302	0	2	7	0.84	0.96
South_Carolina_Digital_Library	23,842	0	3.5	8	1.68	1.41
The_New_York_Public_Library	69,210	0	1	7	1.57	1.12
The_Portal_to_Texas_History	104,566	0	1	7	0.84	0.91
United_States_Government_Printing_Office_(GPO)	174,067	0	2	7	1.38	0.99
University_of_Illinois_at_Urbana-Champaign	6,183	0	2	6	1.31	1.25
University_of_Southern_California_Libraries	65,958	0	0	7	0.75	1.09
University_of_Virginia_Library	3,736	0	5	7	1.67	1.58
	63	0	2	5	1.17	1.31

Average Punctuation Characters

Again on this one I don’t have much to talk about. I do know that I plan to take a look at what punctuation characters are being used by which hubs. I have a feeling that this could be very useful in identifying problems with mapping from one metadata world to another. For example I know there are examples of character patterns that resemble sub-field indicators from a MARC record in the subject values in the DPLA, dataset, (‡, |, and — ) how many that’s something to look at.

Let me know if there are other pieces that you think might be interesting to look at related to this subject work with the DPLA metadata dataset and I’ll see what I can do.

Let me know what you think via Twitter if you have questions or comments.

Effects of subject normalization on DPLA Hubs

In the previous post I walked through some of the different ways that we could normalize a subject string and took a look at what effects these normalizations had on the subjects in the entire DPLA metadata dataset that I have been using.

This post I wanted to continue along those lines and take a look at what happens when you apply these normalizations to the subjects in the dataset, but this time focus on the Hub level instead of working with the whole dataset.

I applied the normalizations mentioned in the previous post to the subjects from each of the Hubs in the DPLA dataset. This included total values, unique but un-normalized values, case folded, lower cased, NACO, Porter stemmed, and fingerprint. I applied the normalizations on the output of the previous normalization as a series, here is an example of what the normalization chain looked like for each.

total
total > unique
total > unique > case folded
total > unique > case folded > lowercased
total > unique > case folded > lowercased > NACO
total > unique > case folded > lowercased > NACO > Porter
total > unique > case folded > lowercased > NACO > Porter > fingerprint

The number of subjects after each normalization is presented in the first table below.

Hub Name	Total Subjects	Unique Subjects	Folded	Lowercase	NACO	Porter	Fingerprint
ARTstor	194,883	9,560	9,559	9,514	9,483	8,319	8,278
Biodiversity_Heritage_Library	451,999	22,004	22,003	22,002	21,865	21,482	21,384
David_Rumsey	22,976	123	123	122	121	121	121
Digital_Commonwealth	295,778	41,704	41,694	41,419	40,998	40,095	39,950
Digital_Library_of_Georgia	1,151,351	132,160	132,157	131,656	131,171	130,289	129,724
Harvard_Library	26,641	9,257	9,251	9,248	9,236	9,229	9,059
HathiTrust	2,608,567	685,733	682,188	676,739	671,203	667,025	653,973
Internet_Archive	363,634	56,910	56,815	56,291	55,954	55,401	54,700
J_Paul_Getty_Trust	32,949	2,777	2,774	2,760	2,741	2,710	2,640
Kentucky_Digital_Library	26,008	1,972	1,972	1,959	1,900	1,898	1,892
Minnesota_Digital_Library	202,456	24,472	24,470	23,834	23,680	22,453	22,282
Missouri_Hub	97,111	6,893	6,893	6,850	6,792	6,724	6,696
Mountain_West_Digital_Library	2,636,219	227,755	227,705	223,500	220,784	214,197	210,771
National_Archives_and_Records_Administration	231,513	7,086	7,086	7,085	7,085	7,050	7,045
North_Carolina_Digital_Heritage_Center	866,697	99,258	99,254	99,020	98,486	97,993	97,297
Smithsonian_Institution	5,689,135	348,302	348,043	347,595	346,499	344,018	337,209
South_Carolina_Digital_Library	231,267	23,842	23,838	23,656	23,291	23,101	22,993
The_New_York_Public_Library	1,995,817	69,210	69,185	69,165	69,091	68,767	68,566
The_Portal_to_Texas_History	5,255,588	104,566	104,526	103,208	102,195	98,591	97,589
United_States_Government_Printing_Office_(GPO)	456,363	174,067	174,063	173,554	173,353	172,761	170,103
University_of_Illinois_at_Urbana-Champaign	67,954	6,183	6,182	6,150	6,134	6,026	6,010
University_of_Southern_California_Libraries	859,868	65,958	65,882	65,470	64,714	62,092	61,553
University_of_Virginia_Library	93,378	3,736	3,736	3,672	3,660	3,625	3,618

Here is a table that shows the percentage reduction after each field is normalized with a specific algorithm. The percent reduction makes it a little easier to interpret.

Hub Name	Folded Normalization	Lowercase Normalization	Naco Normalization	Porter Normalization	Fingerprint Normalization
ARTstor	0.0%	0.5%	0.8%	13.0%	13.4%
Biodiversity_Heritage_Library	0.0%	0.0%	0.6%	2.4%	2.8%
David_Rumsey	0.0%	0.8%	1.6%	1.6%	1.6%
Digital_Commonwealth	0.0%	0.7%	1.7%	3.9%	4.2%
Digital_Library_of_Georgia	0.0%	0.4%	0.7%	1.4%	1.8%
Harvard_Library	0.1%	0.1%	0.2%	0.3%	2.1%
HathiTrust	0.5%	1.3%	2.1%	2.7%	4.6%
Internet_Archive	0.2%	1.1%	1.7%	2.7%	3.9%
J_Paul_Getty_Trust	0.1%	0.6%	1.3%	2.4%	4.9%
Kentucky_Digital_Library	0.0%	0.7%	3.7%	3.8%	4.1%
Minnesota_Digital_Library	0.0%	2.6%	3.2%	8.3%	8.9%
Missouri_Hub	0.0%	0.6%	1.5%	2.5%	2.9%
Mountain_West_Digital_Library	0.0%	1.9%	3.1%	6.0%	7.5%
National_Archives_and_Records_Administration	0.0%	0.0%	0.0%	0.5%	0.6%
North_Carolina_Digital_Heritage_Center	0.0%	0.2%	0.8%	1.3%	2.0%
Smithsonian_Institution	0.1%	0.2%	0.5%	1.2%	3.2%
South_Carolina_Digital_Library	0.0%	0.8%	2.3%	3.1%	3.6%
The_New_York_Public_Library	0.0%	0.1%	0.2%	0.6%	0.9%
The_Portal_to_Texas_History	0.0%	1.3%	2.3%	5.7%	6.7%
United_States_Government_Printing_Office_(GPO)	0.0%	0.3%	0.4%	0.8%	2.3%
University_of_Illinois_at_Urbana-Champaign	0.0%	0.5%	0.8%	2.5%	2.8%
University_of_Southern_California_Libraries	0.1%	0.7%	1.9%	5.9%	6.7%
University_of_Virginia_Library	0.0%	1.7%	2.0%	3.0%	3.2%

Here is that data presented as a graph that I think shows the data a even better.

Reduction Percent after Normalization

You can see that for many of the Hubs you see the biggest reduction happening when applying the Porter Normalization and the Fingerprint Normalization. Hubs of note are ArtStore which had the highest percentage of reduction of the hubs. This was primarily caused by the Porter normalization which means that there were a large percentage of subjects that stemmed to the same stem, often this is plural vs singular versions of the same subject. This may be completely valid with out ArtStore chose to create metadata but is still interesting.

Another hub I found interesting with this data was that from Harvard where the biggest reduction happened with the Fingerprint Normalization. This might suggest that there are a number of values that are the same just with different order. For example names that occur in both inverted and non-inverted form.

In the end I’m not sure how helpful this is as an indicator of quality within a field. There are fields that would benefit from this sort of normalization more than others. For example subjects, creator, contributor, publisher will normalize very differently than a field like title or description.

Let me know what you think via Twitter if you have questions or comments.

Metadata normalization as an indicator of quality?

Metadata quality and assessment is a concept that has been around for decades in the library community. Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana. Both of these groups make available their metadata so that others can remix and reuse the data in new ways.

I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields. This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”. Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.

This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.

One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry. Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.

OpenRefine Cluster and Edit Screen Capture

OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation. Depending on ones data one approach may perform better than others for this kind of clustering.

Normalization

Case normalization is probably the easiest to kind of normalization to understand. If you have two strings, say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.

Case folding is another kind of transformation that is fairly common in the world of libraries. This is the process of taking a string like “José” and converting it to “Jose”. While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.

In addition to case folding and lower casing, libraries have been normalizing data for a long time, there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings. Often referred to as NACO normalizations rules, they are Authority File Comparison Rules. I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service. In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.

Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.

What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.

Here is an example.

If we have the following sets of subjects:

 Musical Instruments
 Musical Instruments.
 Musical instrument
 Musical instruments
 Musical instruments,
 Musical instruments.

If you applied the simplified NACO normalization from pynaco you would end up with the following strings:

musical instruments
musical instruments
musical instrument
musical instruments
musical instruments
musical instruments

If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:

music instrument
music instrument
music instrument
music instrument
music instrument
music instrument

So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.

Experiment

In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset. I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.

I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.

d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers
d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States
d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms)
4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago
4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago
4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago
63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York
63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York
1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais

Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.

Total vs Unique

The first thing I did was to make the 23,858,236 long text file only contain unique values. I do this with the tried and true method of using unix sort and uniq.

sort subjects_all.txt | uniq > subjects_uniq.txt

After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.

Case folding

Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.

I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects. That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects. So nothing huge.

Lowercase

The next normalization tested was lowercasing of the values. I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.

By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.

NACO Normalization

Next we look at the simple NACO normalization from pynaco. I applied this to the unique lower cased subjects from the previous step.

With the NACO normalization, I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects. This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.

Porter stemming

Moving along, I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step. I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.

With the Portal stemmer applied, I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.

Fingerprint

Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature. This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”. I used the output of the previous Porter stemming step as the input for this normalization.

With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.

Overview

	Reduction	Occurrences	Percent Reduction
Unique	0	1,871,882	0%
Case Folded	4,753	1,867,129	0.3%
Lowercase	22,200	1,849,682	1.2%
NACO	45,359	1,826,523	2.4%
Porter	70,768	1,801,114	3.8%
Fingerprint	105,393	1,766,489	5.6%

Conclusion

I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.

I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on. One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string. I wrote about the effect on the subjects for the DPLA dataset in a previous post.

As always feel free to contact me via Twitter if you have questions or comments.

Creator and Use Data for the UNT Scholarly Works Repository

I had a question asked this last week about what was the most “used” item in our UNT Scholarly Works Repository, that led to discussion of the most “used” creator across that same collection. I spent a few minutes going through the process of pulling this data and thought that it would make a good post and allow me to try out writing some step by step instructions.

Here are the things that I was interested in.

What creator has the most items where they are an author or co-author in the UNT Scholarly Works Repository?
What is the most used item in the repository?
What author has the highest “average item usage” ?
How do these lists compare?

In order to answer these questions there are a number of steps that I had to go through in order to get the final data. This post will walk us through the steps later.

Get a list of the item identifiers in the collection
Grab the stats and metadata for each of the identifiers
Convert metadata and stats into a format that can be processed
Add up uses per item, per author, sort and profit.

So here we go.

Downloading the identifiers.

We have a number of API’s for each collection in our digital library. These are very very simple APIs compared to some of those offered by other systems, and in many cases our primary API consists of technologies like OAI-PMH, OpenSearch and simple text lists or JSON files. Here is the documentation for the APIs available for the UNT Scholarly Works Repository. For this project the API I’m interested in is the identifiers list. If you go to this URL http://digital.library.unt.edu/explore/collections/UNTSW/identifiers/ you can get all of the public identifiers for the collection.

Here is the WGET command that I use to grab this file and to save it as a file called untsw.arks

[vphill]$ wget http://digital.library.unt.edu/explore/collections/UNTSW/identifiers/ -O untsw.arks

Now that we have this file we can quickly get a count for the total number of items we will be working with by using the wc command.

[vphill]$ wc -l untsw.arks
3731 untsw.arks

We can quickly see that there are 3,731 identifiers in this file.

Next up we want to adjust that arks file a bit to get at just the name part of the ark, locally we call these either meta_ids or ids for short. I will use the sed command to get rid of the ark:/67531/ part of each line and then save the resulting line as a new file. Here is that command

sed "s/ark:/67531///" untsw.arks > untsw.ids

Now we have a file untsw.ids that looks like this:

metadc274983
metadc274993
metadc274992
metadc274991
metadc274998
metadc274984
metadc274980
metadc274999
metadc274985
metadc274995

We will use this file to now grab the metadata and usage stats for each item.

Downloading Stats and Metadata

For this step we will make use of an undocumented API for our system, internally it is called the “resource_object”. For a given item http://digital.library.unt.edu/ark:/67531/metadc274983/ if you append resource_object.json you will get the JSON representation of the resource object we use for all of our templating in the system. http://digital.library.unt.edu/ark:/67531/metadc274983/resource_object.json is the resulting URL. Depending on the size of the object, this resource object could be quite large because it has a bunch of data inside.

Two pieces of data that are important to us are the usage stats and the metadata for the item itself. We will make use of wget again to grab this info, and a quick loop to help automate the process a bit more. Before we grab all of these files we want to create a folder called “data” to store content in.

[vphill]$ mkdir data
[vphill]$ for i in in `cat untsw.ids` ; do wget -nc "http://digital.library.unt.edu/ark:/67531/$i/resource_object.json" -O data/$i.json ; done

What this does, first we create a directory called data with the mkdir command.

Next we loop over all of the lines in the untsw.ids file by using the cat command to read the file. Each line or iteration of the loop, the variable $i will contain a new meta_id from the file.

Each iteration of the loop we will use wget to grab the resource_object.json and save it to a json file in the data directory named using the meta_id with .json appended to the end.

I’ve added the -nc option to wget that means “no clobber” so if you have to restart this step it won’t try and re-download items that have already been downloaded.

This step can take a few minutes depending on the size of the collection you are pulling. I think it took about 15 minutes for my 3,731 items in the UNT Scholarly Works Repository.

Converting the Data

For this next section I have three bits of code that I use to get at the data inside of the JSON files that we downloaded in “data” folder. I suggest now creating a “code” folder using the mkdir again so that we can place the following python scripts into them. The names for each of these files are as follows: get_creators.py, get_usage.py, and reducer.py.

#get_creators.py

import sys
import json

if len(sys.argv) != 2:
    print "usage: %s <untl resource object json file>" % sys.argv[0]
    exit(-1)

filename = sys.argv[1]
data = json.loads(open(filename).read())

total_usage = data["stats"]["total"]
meta_id = data["meta_id"]

metadata = data["desc_MD"].get("creator", [])

creators = []
for i in metadata:
    creators.append(i["content"]["name"].replace("t", " "))

for creator in creators:
   out = "t".join([meta_id, creator, str(total_usage)])
   print out.encode('utf-8')

Copy the above text into a file inside your “code” folder called get_creators.py

#get_usage.py
import sys
import json

if len(sys.argv) != 2:
    print "usage: %s <untl resource object json file>" % sys.argv[0]
    exit(-1)

filename = sys.argv[1]
data = json.loads(open(filename).read())

total_usage = data["stats"]["total"]
meta_id = data["meta_id"]
title = data["desc_MD"]["title"][0]["content"].replace("t", " ")

out = "t".join([meta_id, str(total_usage), title])
print out.encode("utf-8")

Copy the above text into a file inside your “code” folder called get_usage.py

#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file, separator='t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator='t'):
    # input comes from STDIN (standard input)
    data = read_mapper_output(sys.stdin, separator=separator)
    # groupby groups multiple word-count pairs by word,
    # and creates an iterator that returns consecutive keys and their group:
    # current_word - string containing a word (the key)
    # group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
    for current_word, group in groupby(data, itemgetter(0)):
        try:
            total_count = sum(int(count) for current_word, count in group)
            print "%s%s%d" % (current_word, separator, total_count)
        except ValueError:
            # count was not a number, so silently discard this item
            pass

if __name__ == "__main__":
    main()

Copy the above text into a file inside your “code” folder called reducer.py

Now that we have these three scripts, I want to loop over all of the JSON files in the data directory and pull out information from them. First we use the get_usage.py script and redirect the output of that script to a file called usage.txt

[vphill]$ for i in data/*.json ; do python code/get_usage.py "$i" ; done > usage.txt

Here is what that file looks like when you look at the first ten lines.

metadc102275 447 Feeling Animal: Pet-Making and Mastery in the Slave's Friend
metadc102276 48 An Extensible Approach to Interoperability Testing: The Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs
metadc102277 114 Using Assessment to Guide Strategic Planning
metadc102278 323 This Side of the Border: The Mexican Revolution through the Lens of American Photographer Otis A. Aultman
metadc102279 88 Examining MARC Records as Artifacts That Reflect Metadata Utilization Decisions
metadc102280 155 Genetic Manipulation of a "Vacuolar" H+ -PPase: From Salt Tolerance to Yield Enhancement under Phosphorus-Deficient Soils
metadc102281 82 Assessing Interoperability in the Networked Environment: Standards, Evaluation, and Testbeds in the Context of Z39.50
metadc102282 67 Is It Really That Bad? Verifying the extent of full-text linking problems
metadc102283 133 The Hunting Behavior of Black-Shouldered Kites (Elanus Caeruleus Leucurus) in Central Chile
metadc102284 199 Ecological theory and values in the determination of conservation goals: examples from temperate regions of Germany, United States of America, and Chile

It is a tab delimited file with three fields, the meta_id, the usage count and finally the title of the item.

The next thing we want to do is create another list of creators and their usage data. We do that in a similar was as in the previous step. The command below should get you where you want to go.

[vphill]$ for i in data/* ; do python code/get_creators.py "$i" ; done > creators.txt

Here is a sample of what this file looks like.

metadc102275 Keralis, Spencer D. C. 447
metadc102276 Moen, William E. 48
metadc102276 Hammer, Sebastian 48
metadc102276 Taylor, Mike 48
metadc102276 Thomale, Jason 48
metadc102276 Yoon, JungWon 48
metadc102277 Avery, Elizabeth Fuseler 114
metadc102278 Carlisle, Tara 323
metadc102279 Moen, William E. 88
metadc102280 Gaxiola, Roberto A. 155

Here again you have a tab delimited file with the meta_id, name and usage for that name in that item. You can see that there are five entries for the item metadc102276 because there were five creators for that item.

Looking at the Data

The final step (and the thing that we’ve been waiting for is to actually do some work with this data. This is easy to do with a few standard unix/linux command line tools. The work below will make use of the tools wc, sort, uniq, cut, and head

Most used items

The first thing that we can do with the usage.txt file is to see which items were used the most. If we use the following command you can get at this data.

[vphill]$ sort -t$'t' -k 2nr usage.txt | head

We need to sort the usage.txt file by the second column with the data being treated as numeric data. We would like this in reverse order or from the largest to the smallest. The sort command we use above uses the -t option to say that we want to treat the tab character as the delimiter instead of the default space character and the the -k option says to use the second column as a number in reverse order. We pipe this output to the head program which take the first ten results and spits them out. We should have something that looks like the following (formatted to a table for easier reading).

meta_id	usage	title
metadc30374	5,153	Appendices To: The UP/SP Merger: An Assessment of the Impacts on the State of Texas
metadc29400	5,075	Remote Sensing and GIS for Nonpoint Source Pollution Analysis in the City of Dallas’ Eastern Watersheds
metadc33126	4,691	Research Consent Form: Focus Groups and End User Interviews
metadc86949	3,712	The First World War: American Ideals and Wilsonian Idealism in Foreign Policy
metadc33128	3,512	Summary Report of the Needs Assessment
metadc86874	2,986	Synthesis and Characterization of Nickel and Nickel Hydroxide Nanopowders
metadc86872	2,886	Depression in college students: Perceived stress, loneliness, and self-esteem
metadc122179	2,766	Cross-Cultural Training and Success Versus Failure of Expatriates
metadc36277	2,564	What’s My Leadership Color?
metadc29807	2,489	Bishnoi: An Eco-Theological “New Religious Movement” In The Indian Desert

Creators with the most uses

The next thing we want to do is look at the creators that had the most collective uses in the entire dataset. For this we use the creators.txt file and grab only the name and usage field. We then sort by the name field so they are all in alphabetical order. We use the reducer.py script to add up the uses for each name (must be sorted before you do this step) and then we pipe that to the sort program again. Here is the command.

[vphill]$ cut -f 2,3 creators.txt | sort | python code/reducer.py | sort -t$'t' -k 2nr | head

Hopefully there are portions of the above command that are recognizable from the previous example (sorting by the second column and head) with some new things thrown in. Again I’ve converted the output to a table for easier viewing.

Creator	Total Aggregated Uses per Creator
Murray, Kathleen R.	24,600
Mihalcea, Rada, 1974-	23,960
Cundari, Thomas R., 1964-	20,903
Phillips, Mark Edward	20,023
Acree, William E. (William Eugene)	18,930
Clower, Terry L.	14,403
Alemneh, Daniel Gelaw	13,069
Weinstein, Bernard L.	13,008
Moen, William E.	12,615
Marshall, James L., 1940-	8,692

Publications Per Creator

Another thing that is helpful is to pull the list of publications per author which we can do easily with our creators.txt list.

Here is the command we will want to use.

[vphill]$ cut -f 2 creators.txt | sort | uniq -c | sort -nr | head

This command should be familiar from previous examples, the new command that I’ve added is uniq with the option to count the unique instances of each name. I then sort on that count in reverse order (highest to lowest) and take the top ten results.

The output will look something like this

 267 Acree, William E. (William Eugene)
 161 Phillips, Mark Edward
 114 Alemneh, Daniel Gelaw
 112 Cundari, Thomas R., 1964-
 108 Mihalcea, Rada, 1974-
 106 Grigolini, Paolo
  90 Falsetta, Vincent
  87 Moen, William E.
  86 Dixon, R. A.
  85 Spear, Shigeko

To keep up with the formatted tables, here are the top ten most prolific creators in the UNT Scholarly Works Repository.

Creators	Items
Acree, William E. (William Eugene)	267
Phillips, Mark Edward	161
Alemneh, Daniel Gelaw	114
Cundari, Thomas R., 1964-	112
Mihalcea, Rada, 1974-	108
Grigolini, Paolo	106
Falsetta, Vincent	90
Moen, William E.	87
Dixon, R. A.	86
Spear, Shigeko	85

Average Use Per Item

A bonus exercise you can do is combine the creators use counts with the number of items they have in the repository to identify their average item usage number. I did that to the top ten creators by overall use and you can see how that shows some interesting things too.

Name	Total Aggregate Uses	Items	Use Per Item Ratio
Murray, Kathleen R.	24,600	65	378
Mihalcea, Rada, 1974-	23,960	108	222
Cundari, Thomas R., 1964-	20,903	112	187
Phillips, Mark Edward	20,023	161	124
Acree, William E. (William Eugene)	18,930	267	71
Clower, Terry L.	14,403	54	267
Alemneh, Daniel Gelaw	13,069	114	115
Weinstein, Bernard L.	13,008	49	265
Moen, William E.	12,615	87	145
Marshall, James L., 1940-	8,692	71	122

It is interesting to see that Murray, Kathleen R. has both the highest aggregate uses as well as the highest Use Per Item Ratio. Other authors like Acree, William E. (William Eugene) who have many publications go down a bit in rank if you ordered by Use Per Item Ratio.

Conclusion

Depending on what side of the fence you sit on this post either demonstrates remarkable flexibility in the way you can get at data in a system, or it will make you want to tear your hair out because there isn’t a pre-built interface for these reports in the system. I’m of the camp that the way we’ve done things is a feature and not a bug but again many will have a different view.

How do you go about getting this data out of your systems? Is the process much easier, much harder or just about the same?

As always feel free to contact me via Twitter if you have questions or comments.

Extended Date Time Format (EDTF) use in the DPLA: Part 3, Date Patterns

Date Values

I wanted to take a look at the date values that had made their way into the DPLA dataset from the various Hubs. The first thing that I was curious about was how many unique date strings are present in the dataset, it turns out that there are 280,592 unique date strings.

Here are the top ten date strings, their instance and then if the string is a valid EDTF string.

Date Value	Instances	Valid EDTF
[Date Unavailable]	183,825	FALSE
1939-1939	125,792	FALSE
1960-1990	73,696	FALSE
1900	28,645	TRUE
1935 – 1945	27,143	FALSE
1909	26,172	TRUE
1910	26,106	TRUE
1907	25,321	TRUE
1901	25,084	TRUE
1913	24,966	TRUE

It looks like “[Date Unavailable]” is a value used by the New York Public Library in denoting that an item does not have an available date. It should be noted that NYPL also has 377,664 items in the DPLA that have no date value present at all, so this isn’t a default behavior for items without a date. Most likely it is practice within a single division that denotes unknown or missing dates this way. The value “1939-1939” is used heavily by the University of Southern California. Libraries and seems to come from a single set of WPA Census Cards in their collection. The value “1960-1990” is used primarily for the items in the J. Paul Getty Trust.

Date Length

I was also curious as to the length of the dates in the dataset. I was sure that I would find large numbers of date strings that were four digits in length (1923), ten digits in length (1923-03-04) and other lengths for common highly used date formats. I also figured that there would be instances of dates that were either less than four digits and also longer than one would expect for a date string. Here are some example date strings for both.

Top ten date strings shorter than four characters

Date Value	Instances
*	968
昭和3	521
昭和2	447
昭和4	439
昭和5	391
昭和9	388
昭和6	382
昭和7	366
大正4	323
昭和8	322

I’m not sure what “*” means for a date value, but the other values seem to be Japanese versions of four digit dates (this is what google translate tells me). There are 14,402 records that have date strings shorter than three characters and a total of 522 unique date strings present.

Top ten date strings longer than fifty characters.

Date Value	Instances
Miniature repainted: 12th century AH/AD 18th (Safavid)	35
Some repainting: 13th century AH/AD 19th century (Safavid	25
11th century AH/AD 17th century-13th century AH/AD 19th century (Safavid (?))	15
1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939	13
10th century AH/AD 16th century-12th century AH/AD 18th century (Ottoman)	10
late 11th century AH/AD 17th century-early 12th century AH/AD 18th century (Ottoman)	8
5th century AH/AD 11th century-6th century AH/AD 12th century (Abbasid)	7
4th quarter 8th century AH/AD 14th century (Mamluk)	5
L’an III de la République française … [1794-1795]	5
Began with 1st rept. (112th Congress, 1st session, published June 24, 2011)	3

There are 1,033 items with 894 unique values that are over fifty characters in length. The longest is a “date string” 193 characters, with a value of “chez W. Innys, J. Brotherton, R. Ware, W. Meadows, T. Meighan, J. & P. Knapton, J. Brindley, J. Clarke, S. Birt, D. Browne, T. Dongman, J. Shuckburgh, C. Hitch, J. Hodges, S. Austen, A. Millar,” which appears to be a mis-placement of another field’s data.

Here is the distribution of these items with date strings with fifty characters in length or more.

Hub Name	Items with Date Strings 50 Characters or Longer
United States Government Printing Office (GPO)	683
HathiTrust	172
ARTstor	112
Mountain West Digital Library	31
Smithsonian Institution	25
University of Illinois at Urbana-Champaign	3
J. Paul Getty Trust	2
Missouri Hub	2
North Carolina Digital Heritage Center	2
Internet Archive	1

It seems that a large portion of these 50+ character date strings are present in the Government Printing Office records.

Date Patterns

Another way of looking at dates that I experimented with for this project was to convert a date string into what I’m calling a “date pattern”. For this I take an input string, say “1940-03-22” and that would get mapped to 0000-00-00. I convert all digits to zero, all letters to the letter a and leave all characters that are not alpha-numeric.

Below is the function that I use for this.

def get_date_pattern(date_string):
    pattern = []
    if date_string is None:
        return None
    for c in date_string:
        if c.isalpha():
            pattern.append("a")
        elif c.isdigit():
            pattern.append("0")
        else:
            pattern.append(c)
    return "".join(pattern)

By applying this function to all of the date strings in the dataset I’m able to take a look at what overall date patterns (and also features) are being used throughout the dataset, and ignore the specific values.

There are a total of 74 different date patterns for date strings that are valid EDTF. For those date strings that are not valid date strings, there are a total of 13,643 date strings. I’ve pulled the top ten date patterns for both valid EDTF and not valid EDTF date strings and presented them below.

Valid EDTF Date Patterns

Valid EDTF Date Pattern	Instances	Example
0000	2,114,166	2004
0000-00-00	1,062,935	2004-10-23
0000-00	107,560	2004-10
0000/0000	55,965	2004/2010
0000?	13,727	2004?
[0000-00-00..0000-00-00]	4,434	[2000-02-03..2001-03-04]
0000-00/0000-00	4,181	2004-10/2004-12
0000~	3,794	2003~
0000-00-00/0000-00-00	3,666	2003-04-03/2003-04-05
[0000..0000]	3,009	[1922..2000]

You can see that the basic date formats yyyy, yyyy-mm-dd, and yyyy-mm very popular in the dataset. Following that intervals are used in the format of yyyy/yyyy and uncertain dates with yyyy?.

Non-Valid EDTF Date Patterns

Non-Valid EDTF Date Pattern	Instances	Example
0000-0000	1,117,718	2005-2006
00/00/0000	486,485	03/04/2006
[0000]	196,968	[2006]
[aaaa aaaaaaaaaaa]	183,825	[Date Unavailable]
00 aaa 0000	143,423	22 Jan 2006
0000 – 0000	134,408	2000 – 2005
0000-aaa-00	116,026	2003-Dec-23
0 aaa 0000	62,950	3 Jan 2000
0000]	58,459	1933]
aaa 0000	43,676	Jan 2000

Many of the date strings that are represented by these dates have the possibility of being “cleaned up” by simple transforms if that was of interest. I would imagine that converting the 0000-0000 to 0000/0000 would be a fairly lossless transform that would suddenly change over a million items so that they are valid EDTF. Converting the format 00/00/0000 to 0000-00-00 is also a straight-forward transform if you know if 00-00 is mm-dd (US) or dd-mm (non-US). Removing the brackets around four digit years [0000] seems to be another easy fix to convert a large number of dates. Of the top ten non-valid EDTF Date Patterns, it might be possible to convert nine of them with simple transformations to become valid EDTF date strings. This would give the DPLA 2,360,113 additional dates that are valid EDTF date strings. The values for the date pattern [aaaa aaaaaaaaaaa] with a date string value of [Date Unavailable] might benefit from being removed from the dataset altogether in order to reduce some of the noise in the field.

Common Patterns Per Hub

One last thing that I wanted to do was to see i there are any commonalities between the hubs when you look at their most frequently used date patterns. Below I’ve created tables for both valid EDTF date patterns and non-valid EDTF date patterns.

Valid EDTF Patterns

Hub Name	Pattern 1	Pattern 2	Pattern 3	Pattern 4	Pattern 5
ARTstor	0000	0000-00	0000?	0000/0000	0000-00-00
Biodiversity Heritage Library	0000	-0000	0000/0000	0000-00	0000?
David Rumsey	0000
Digital Commonwealth	0000-00-00	0000-00	0000	0000-00-00a00:00:00a
Digital Library of Georgia	0000-00-00	0000-00	0000/0000	0000	0000-00-00/0000-00-00
Harvard Library	0000	00aa	000a	aaaa
HathiTrust	0000	0000-00	0000?	-0000	00aa
Internet Archive	0000	0000-00-00	0000-00	0000?	0000/0000
J. Paul Getty Trust	0000	0000?
Kentucky Digital Library	0000
Minnesota Digital Library	0000	0000-00-00	0000?	0000-00	0000-00-00?
Missouri Hub	0000-00-00	0000	0000-00	0000/0000	0000?
Mountain West Digital Library	0000-00-00	0000	0000-00	0000?	0000-00-00a00:00:00a
National Archives and Records Administration	0000	0000?
North Carolina Digital Heritage Center	0000-00-00	0000	0000-00	0000/0000	0000?
Smithsonian Institution	0000	0000?	0000-00-00	0000-00	00aa
South Carolina Digital Library	0000-00-00	0000	0000-00	0000?
The New York Public Library	0000-00-00	0000-00	0000	-0000	0000-00-00/0000-00-00
The Portal to Texas History	0000-00-00	0000	0000-00	[0000-00-00..0000-00-00]	0000~
United States Government Printing Office (GPO)	0000	0000?	aaaa	-0000	[0000, 0000]
University of Illinois at Urbana-Champaign	0000	0000-00-00	0000?	0000-00
University of Southern California. Libraries	0000-00-00	0000/0000	0000	0000-00	0000-00/0000-00
University of Virginia Library	0000-00-00	0000	0000-00	0000?	0000?-00

I tried to color code the five most common EDTF date patterns from above in the following image.

Color-coded date patterns per Hub.

I’m not sure if that makes it clear or not where the common date patterns fall or not.

Non Valid EDTF Patterns

Hub Name	Pattern 1	Pattern 2	Pattern 3	Pattern 4	Pattern 5
ARTstor	0000-0000	aa. 0000	aaaaaaa	0000a	aa. 0000-0000
Biodiversity Heritage Library	0000-0000	0000 – 0000	0000-	0000-00	[0000-0000]
David Rumsey
Digital Commonwealth	0000-0000	aaaaaaa	0000-00-00-0000-00-00	0000-00-0000-00	0000-0-00
Digital Library of Georgia	0000-0000	0000-00-00	0000-00- 00	aaaaa 0000	0000a
Harvard Library	0000a-0000a	a. 0000	0000a	0000-0000	0000 – a. 0000
HathiTrust	[0000]	0000-0000	0000]	[a0000]	a0000
Internet Archive	0000-0000	0000-00	0000-	[0—]	[0000]
J. Paul Getty Trust	0000-0000	a. 0000-0000	a. 0000	[000-]	[aa. 0000]
Kentucky Digital Library
Minnesota Digital Library	0000 – 0000	0000-00 – 0000-00	0000-0000	0000-00-00 – 0000-00-00	0000 – 0000?
Missouri Hub	a0000	0000-00-00	aaaaaaaa 00, 0000	aaaaaaa 00, 0000	aaaaaaaa 0, 0000
Mountain West Digital Library	0000-0000	aa. 0000-0000	aa. 0000	0000? – 0000?	0000 aa
National Archives and Records Administration	00/00/0000	00/0000	a’aa. 0000′-a’aa. 0000′	a’00/0000′-a’00/0000′	a’00/00/0000′-a’00/00/0000′
North Carolina Digital Heritage Center	0000-0000	00000000	00000000-00000000	aa. 0000-0000	aa. 0000
Smithsonian Institution	0000-0000	00 aaa 0000	0000-aaa-00	0 aaa 0000	aaa 0000
South Carolina Digital Library	0000-0000	0000 – 0000	0000-	0000-00-00	0000-0-00
The New York Public Library	0000-0000	[aaaa aaaaaaaaaaa]	0000 – 0000	0000-00-00 – 0000-00-00	0000-
The Portal to Texas History	a. 0000	[0000]	0000 – 0000	[aaaaaaa 0000 aaa 0000]	a.0000 – 0000
United States Government Printing Office (GPO)	[0000]	0000-0000	[0000?]	aaaaa aaaa 0000	00aa-0000
University of Illinois at Urbana-Champaign	0-00-00	a. 0000	00/00/00	0-0-00	00-00-00
University of Southern California. Libraries	0000-0000	aaaaa 0000/0000	aaaaa 0000-00-00/0000-00-00	0000a	aaaaa 0000-0000
University of Virginia Library	aaaaaaa aaaa	a0000	aaaaaaa 0000 aaa 0000?	aaaaaaa 0000 aaa 0000	00–?

With the non-valid EDTF Date Patterns you can see where some of the date patterns are much more common across the various Hubs than others.

I hope you have found these posts interesting. If you’ve worked with metadata, especially aggregated metadata you will no doubt recognize much of this from your datasets, if you are new to this area or haven’t really worked with the wide range of date values that you can come in contact with in large metadata collections, have no fear, it is getting better. The EDTF is a very good specification for cultural heritage institutions to adopt for their digital collections. It helps to provide both a machine and human readable format for encoding and notating the complex dates we have to work with in our field.

If there is another field that you would like me to take a look at in the DPLA dataset, please let me know.

As always feel free to contact me via Twitter if you have questions or comments.

Extended Date Time Format (EDTF) use in the DPLA: Part 2, EDTF use by Hub

This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America. For more background on this topic take a look at the first post in this series.

EDTF Use by Hub

In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system. Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.

Looking overall at the dates that were present, there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.

One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.

Hub Name	Items With Date	% of total items with date present	Valid EDTD	Valid EDTF %	Not Valid EDTF	Not Valid EDTF %
ARTstor	49,908	88.6%	26,757	53.6%	23,151	46.4%
Biodiversity Heritage Library	29,000	21.0%	22,734	78.4%	6,266	21.6%
David Rumsey	48,132	100.0%	48,132	100.0%	0	0.0%
Digital Commonwealth	118,672	95.1%	14,731	12.4%	103,941	87.6%
Digital Library of Georgia	236,961	91.3%	188,263	79.4%	48,687	20.5%
Harvard Library	6,957	65.8%	6,910	99.3%	47	0.7%
HathiTrust	1,881,588	98.2%	1,295,986	68.9%	585,598	31.1%
Internet Archive	194,454	93.1%	185,328	95.3%	9,126	4.7%
J. Paul Getty Trust	92,494	99.8%	6,319	6.8%	86,175	93.2%
Kentucky Digital Library	87,061	68.1%	87,061	100.0%	0	0.0%
Minnesota Digital Library	39,708	98.0%	33,201	83.6%	6,507	16.4%
Missouri Hub	34,742	83.6%	32,192	92.7%	2,550	7.3%
Mountain West Digital Library	634,571	73.1%	545,663	86.0%	88,908	14.0%
National Archives and Records Administration	553,348	78.9%	10,218	1.8%	543,130	98.2%
North Carolina Digital Heritage Center	214,134	82.1%	163,030	76.1%	51,104	23.9%
Smithsonian Institution	675,648	75.3%	44,860	6.6%	630,788	93.4%
South Carolina Digital Library	52,328	68.9%	42,128	80.5%	10,200	19.5%
The New York Public Library	791,912	67.7%	47,257	6.0%	744,655	94.0%
The Portal to Texas History	424,342	88.8%	416,835	98.2%	7,505	1.8%
United States Government Printing Office (GPO)	148,548	99.9%	17,894	12.0%	130,654	88.0%
University of Illinois at Urbana-Champaign	14,273	78.8%	11,304	79.2%	2,969	20.8%
University of Southern California. Libraries	269,880	89.6%	114,293	42.3%	155,573	57.6%
University of Virginia Library	26,072	86.4%	21,798	83.6%	4,274	16.4%

Turning this into a graph helps things show up a bit better.

EDTF info for each of the DPLA Hubs

There are a number of things that can be teased out of here, first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already, notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%. Harvard at 99% and the Portal to Texas History at 98% are also notable. On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid, New York Public Library with 94%, and the J Paul Getty Trust at 93%.

Use of EDTF Level Features

The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2. Level 0 are the basic date features such as date, date and time, and intervals. Level 1 adds features like
Uncertain/Approximate dates, Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates, partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits. Finally Level 2 lets you qualify seasons. For a full list of the features please take a look at the draft specification at the Library of Congress.

When I was preparing the dataset I also tested the dates to see which feature level they matched to. After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here. Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2, I was able to come up with usable data for further analysis. Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.

Hub Name	valid EDTF items	valid-level0	% Level0	valid-level1	% Level1	valid-level2	% Level2
ARTstor	26,757	26,726	99.9%	31	0.1%	0	0.0%
Biodiversity Heritage Library	22,734	22,702	99.9%	32	0.1%	0	0.0%
David Rumsey	48,132	48,132	100.0%	0	0.0%	0	0.0%
Digital Commonwealth	14,731	14,731	100.0%	0	0.0%	0	0.0%
Digital Library of Georgia	188,274	188,274	100.0%	0	0.0%	0	0.0%
Harvard Library	6,910	6,822	98.7%	83	1.2%	5	0.1%
HathiTrust	1,295,990	1,292,079	99.7%	3,662	0.3%	249	0.0%
Internet Archive	185,328	185,115	99.9%	212	0.1%	1	0.0%
J. Paul Getty Trust	6,319	6,308	99.8%	11	0.2%	0	0.0%
Kentucky Digital Library	87,061	87,061	100.0%	0	0.0%	0	0.0%
Minnesota Digital Library	33,201	26,055	78.5%	7,146	21.5%	0	0.0%
Missouri Hub	32,192	32,190	100.0%	2	0.0%	0	0.0%
Mountain West Digital Library	545,663	542,388	99.4%	3,274	0.6%	1	0.0%
National Archives and Records Administration	10,218	10,003	97.9%	215	2.1%	0	0.0%
North Carolina Digital Heritage Center	163,030	162,958	100.0%	72	0.0%	0	0.0%
Smithsonian Institution	44,860	44,642	99.5%	218	0.5%	0	0.0%
South Carolina Digital Library	42,128	42,079	99.9%	49	0.1%	0	0.0%
The New York Public Library	47,257	47,251	100.0%	6	0.0%	0	0.0%
The Portal to Texas History	416,838	402,845	96.6%	6,302	1.5%	7,691	1.8%
United States Government Printing Office (GPO)	17,894	16,165	90.3%	875	4.9%	854	4.8%
University of Illinois at Urbana-Champaign	11,304	11,275	99.7%	29	0.3%	0	0.0%
University of Southern California. Libraries	114,307	114,307	100.0%	0	0.0%	0	0.0%
University of Virginia Library	21,798	21,558	98.9%	236	1.1%	4	0.0%

Looking at the top 25% of the data, you get the following.

EDTF Level Use by Hub

Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.

There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records. The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).

I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.

Feel free to contact me via Twitter if you have questions or comments.

Extended Date Time Format (EDTF) use in the DPLA: Part 1

I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA, I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.

A Bitter Harvest

One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting. One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers. This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.

One thing that came out of many of these aggregation projects, and something that many of us are dealing with today is the fact that “dates are hard”.

Extended Date Time Format

A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.

An example of what this documented was how to represent some of the following date concepts in a machine readable way.

Commonly Used Dates

*Date Feature*	*Example Item*	*Format*	*Example Date*
Year	Book with publication year	YYYY	1902
Month	Monthly journal issue	YYYY-MM	1893-05
Day	Letter	YYYY-MM-DD	1924-03-03
Time	Born-digital photo	YYYY-MM-DDTHH:MM:SS	2003-12-27T11:09:08
Interval	Compiled court documents	YYYY/YYYY	1887/1889
Season	Seasonal magazine issue	YYYY-SS	1957-23
Decade	WWII poster	YYYu	194u
Approximate	Map “circa 1886”	YYYY~	1886~

Some Complex Dates

*Example Item*	*Kind of Date*	*Format*	*Example Date*
Photo taken at some point during an event August 6-9, 1992	One of a Set	[YYYY..YYYY]	[1992-08-06..1992-08-09]
Hand-carved object, “circa 1870s”	Extended Interval (L1)	YYYY~/YYYY~	1870~/1879~
Envelope with a partially-legible postmark	Unspecified	“u” in place of digit(s)	18uu-08-1u
Map possibly created in 1607 or 1630	One of a Set, Uncertain	[YYYY, YYYY]	[1607?, 1630?]

The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).

Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time. As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.

EDTF use in the DPLA

When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed. In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties. In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.

Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF, I had the following questions in mind when I started the analysis presented in this post and a few that will follow.

How many date values in the DPLA are valid EDTF values?
How are these valid EDTF values distributed across the Hubs?
What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
What are the most common date format patterns used in the DPLA?

With these questions in mind I started the analysis

Preparing the Dataset

I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.

I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid. Additionally I tested which level of EDTF each value conformed to. Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.

This resulted in the following fields being indexed for each date

Field	Value
date	2014-04-04
date_valid_edtf	true
date_level0_feature	true
date_level1_feature	false
date_level2_feature	false
date_pattern	0000-00-00

For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.

After several hours of indexing these values into Solr, I was able to start answering some of the questions mentioned above.

Date usage in the DPLA

The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates. Of the 8,012,390 items in my copy of the DPLA dataset, 6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.

I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.

Hub Name	Items	Items With Date	Items With Date %	Items Missing Date	Items Missing Date %
ARTstor	56,342	49,908	88.6%	6,434	11.4%
Biodiversity Heritage Library	138,288	29,000	21.0%	109,288	79.0%
David Rumsey	48,132	48,132	100.0%	0	0.0%
Digital Commonwealth	124,804	118,672	95.1%	6,132	4.9%
Digital Library of Georgia	259,640	236,961	91.3%	22,679	8.7%
Harvard Library	10,568	6,957	65.8%	3,611	34.2%
HathiTrust	1,915,159	1,881,588	98.2%	33,571	1.8%
Internet Archive	208,953	194,454	93.1%	14,499	6.9%
J. Paul Getty Trust	92,681	92,494	99.8%	187	0.2%
Kentucky Digital Library	127,755	87,061	68.1%	40,694	31.9%
Minnesota Digital Library	40,533	39,708	98.0%	825	2.0%
Missouri Hub	41,557	34,742	83.6%	6,815	16.4%
Mountain West Digital Library	867,538	634,571	73.1%	232,967	26.9%
National Archives and Records Administration	700,952	553,348	78.9%	147,604	21.1%
North Carolina Digital Heritage Center	260,709	214,134	82.1%	46,575	17.9%
Smithsonian Institution	897,196	675,648	75.3%	221,548	24.7%
South Carolina Digital Library	76,001	52,328	68.9%	23,673	31.1%
The New York Public Library	1,169,576	791,912	67.7%	377,664	32.3%
The Portal to Texas History	477,639	424,342	88.8%	53,297	11.2%
United States Government Printing Office (GPO)	148,715	148,548	99.9%	167	0.1%
University of Illinois at Urbana-Champaign	18,103	14,273	78.8%	3,830	21.2%
University of Southern California. Libraries	301,325	269,880	89.6%	31,445	10.4%
University of Virginia Library	30,188	26,072	86.4%	4,116	13.6%

Presence of Dates by Hub Name

I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA, the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library. There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA. This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)

When you look at Content-Hubs vs Service-Hubs you see the following.

Hub Type	Items	Items With Date	Items With Date %	Items Missing Date	Items Missing Date %
Content-Hub	5,736,178	4,782,214	83.4%	953,964	16.6%
Service-Hub	2,276,176	1,842,519	80.9%	433,657	19.1%

It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.

Valid EDTF Dates

I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification. It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.

EDTF Valid vs Not Valid

So the split is pretty close.

One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 6 – Average Edit Duration by Facet

This is the sixth post in a series of posts related to metadata edit events for the UNT Libraries’ Digital Collections from January 1, 2014 to December 31, 2014. If you are interested in the previous posts in this series, they talked about the when, what, who, duration based on time buckets and finally calculating the average edit event time.

In the previous post I was able to come up with what I’m using as the edit event duration ceiling for the rest of this analysis. This means that the rest of the analysis in this post will ignore the events that took longer than 2,100 seconds this leaves us with 91,916 (or 97.6% of the original dataset) valid events to analyze after removing 2,306 that had a duration of over 2,100.

Editors

The table below is the user stats for our top ten editors once I’ve ignored items over 2,100 seconds.

username	min	max	edit events	duration sum	mean	stddev
htarver	2	2,083	15,346	1,550,926	101.06	132.59
aseitsinger	3	2,100	9,750	3,920,789	402.13	437.38
twarner	5	2,068	4,627	184,784	39.94	107.54
mjohnston	3	1,909	4,143	562,789	135.84	119.14
atraxinger	3	2,099	3,833	1,192,911	311.22	323.02
sfisher	5	2,084	3,434	468,951	136.56	241.99
cwilliams	4	2,095	3,254	851,369	261.64	340.47
thuang	4	2,099	3,010	770,836	256.09	397.57
mphillips	3	888	2,669	57,043	21.37	41.32
sdillard	3	2,052	2,516	1,599,329	635.66	388.3

You can see that many of these users have very short edit times for their lowest edits and all but one have edit times for the maximum that approach the duration ceiling. The average amount of time spent per edit event ranges from 21 seconds to 10 minutes and 35 seconds.

I know that for user mphillips (me) the bulk of the work I tend to do in the edit system is fixing quick mistakes like missing language codes, editing dates that aren’t in Extended Data Time Format (EDTF) or hiding and un-hiding records. Other users such as sdillard have been working exclusively on a project to create metadata for a collection of Texas Patents that we are describing in the Portal.

Collections

The top ten most edited collections and their statistics are presented below.

Collection Code	Collection Name	min	max	edit events	duration sum	mean	stddev
ABCM	Abilene Library Consortium	2	2,083	8,418	1,358,606	161.39	240.36
JBPC	Jim Bell Texas Architecture Photograph Collection	3	2,100	5,335	2,576,696	482.98	460.03
JJHP	John J. Herrera Papers	3	2,095	4,940	1,358,375	274.97	346.46
ODNP	Oklahoma Digital Newspaper Program	5	2,084	3,946	563,769	142.87	243.83
OKPCP	Oklahoma Publishing Company Photography Collection	4	2,098	5,692	869,276	152.72	280.99
TCO	Texas Cultures Online	3	2,095	5,221	1,406,347	269.36	343.87
TDNP	Texas Digital Newspaper Program	2	1,989	7,614	1,036,850	136.18	185.41
TLRA	Texas Laws and Resolutions Archive	3	2,097	8,600	1,050,034	122.1	172.78
TXPT	Texas Patents	2	2,099	6,869	3,740,287	544.52	466.05
TXSAOR	Texas State Auditor’s Office: Reports	3	1,814	2,724	428,628	157.35	142.94
UNTETD	UNT Theses and Dissertations	5	2,098	4,708	1,603,857	340.67	474.53
UNTPC	University Photography Collection	3	2,096	4,408	1,252,947	284.24	340.36

This data is a little easier to see with a graph.

Average edit duration per collection

Here is my interpretation of what I see in these numbers based on personal knowledge of these collections.

The collections with the highest average duration are the TXPT and JBPC collection, these are followed by the UNTETD, UNTPC, TCP and JJHP collections. The first two (Texas Patents (TXPT) and Jim Bell Texas Architecture Photograph Collection (JBPC) are example of collections that were having metadata records created for the first time via our online editing system. These collections generally required more investigation (either by reading the patent or researching the photograph) and therefore took more time on average to create the records.

Two of the others, the UNT Theses and Dissertation Collection (UNTETD) and the UNT Photography Collection (UNTPC) involved an amount of copy cataloging for the creation of the metadata either from existing MARC records or local finding aids. TheJohn J. Herrera Papers (JJHP) involved, I believe, a working with an existing finding aid, and I know that there was a two step process of creating the record, and then publishing it as unhidden in a different event, therefore lowering the average time considerably. I don’t know that much about the Texas Cultures Online (TCO) work in 2014 to be able to comment there.

On the other end of of the spectrum you have collections like ABCM, ODNP, OKPCP, and TDNP that were projects that averaged a much shorter amount of time on records. For these there were many small edits to the records that were typically completed one field at a time. For some of these it might have just involved fixing a consistent typo, adding the record to a collection or hiding or un-hiding it from public view.

This raises a question for me, is it possible to detect the “kind” of edits that are being made based on their average edit times? That’s something to look at.

Partner Institutions

And now the ten partner institutions that had the most metadata edit events.

Partner Code	Partner Name	min	max	edit events	duration sum	mean	stddev
UNTGD	UNT Libraries Government Documents Department	2	2,099	21,342	5,385,000	252.32	356.43
OKHS	Oklahoma Historical Society	4	2,098	10,167	1,590,498	156.44	279.95
UNTA	UNT Libraries Special Collections	3	2,099	9,235	2,664,036	288.47	362.34
UNT	UNT Libraries	2	2,098	6,755	2,051,851	303.75	458.03
PCJB	Private Collection of Jim Bell	3	2,100	5,335	2,576,696	482.98	460.03
HMRC	Houston Metropolitan Research Center at Houston Public Library	3	2,095	5,127	1,397,368	272.55	345.62
HPUL	Howard Payne University Library	2	1,860	4,528	544,420	120.23	113.97
UNTCVA	UNT College of Visual Arts + Design	4	2,098	4,169	1,015,882	243.68	364.92
HSUL	Hardin-Simmons University Library	3	2,020	2,706	658,600	243.39	361.66
HIGPL	Higgins Public Library	2	1,596	1,935	131,867	68.15	118.5

Again presented as a simple chart.

Average edit duration per partner.

It is easy to see the difference between the Private Collection of Jim Bell (PCJB) with an average of 482 seconds or roughly 8 minutes per edit and the Higgins Public Library (HIGPL) which had an average of 68 seconds, or just over one minute. In the first case with the Private Collection of Jim Bell (PCJB), we were active in creating records for the first time for these items and the average of eight minutes seems to track with what one would imagine it takes to create a metadata record for a photograph. The Higgins Public Library (HIGPL) collection is a newspaper collection that had a single change in the physical description made to all of the items in that partner’s collection. Other partners between these two extremes and have similar characteristics with the lower edit averages happening for partner’s content that is either being edited in a small way, hidden or un-hidden from view.

Resource Type

The final way we will slice the data for this post is by looking at the stats for the top ten resource types.

resource type	min	max	count	sum	mean	stddev
image_photo	2	2,100	30,954	7,840,071	253.28	356.43
text_newspaper	2	2,084	11,546	1,600,474	138.62	207.3
text_leg	3	2,097	8,604	1,050,103	122.05	172.75
text_patent	2	2,099	6,955	3,747,631	538.84	466.25
physical-object	2	2,098	5,479	1,102,678	201.26	326.21
text_etd	5	2,098	4,713	1,603,938	340.32	474.4
text	3	2,099	4,196	1,086,765	259	349.67
text_letter	4	2,095	4,106	1,118,568	272.42	326.09
image_map	3	2,034	3,480	673,707	193.59	354.19
text_report	3	1,814	3,339	465,168	139.31	145.96

Average edit duration for the top ten resource types

The resource type that really stands out in this graph is the text_patents at 538 seconds per record. These items belong to the Texas Patent Collection and they were loaded into the system with very minimal records and we have been working to add new metadata to these resources. The almost ten minutes per record seems to be very standard for the amount of work that is being done with the records.

The text_leg collection is one that I wanted to take another quick look at.

If we calculate the statistics for the users that edited records in this collection we get the following data.

username	min	max	count	sum	mean	stddev
bmonterroso	3	1,825	890	85,254	95.79	163.25
htarver	9	23	5	82	16.4	5.64
mjohnston	3	1,909	3,309	329,585	99.6	62.08
mphillips	5	33	30	485	16.17	7.68
rsittel	3	1,436	654	22,168	33.9	88.71
tharden	3	2,097	1,143	213,817	187.07	241.2
thuang	4	1,812	2,573	398,712	154.96	227.7

Again you really see it with the graph.

Average edit duration for users who edited records that were the text_leg resource type

In this you see that there were a few users (htarver, mphillips, rsittel) who brought down the average duration because they had very quick edits while the rest of the editors either averaged right around 100 seconds per edit average or around two minutes per edit average.

I think that there is more to do with these numbers, I think calculating the average total duration for a given metadata record in the system as edits are performed on it will be something of interest for a later post. So check back for the next post in this series.

As always feel free to contact me via Twitter if you have questions or comments.

Metadata Edit Events: Part 5 – Identifying an average metadata editing time.

This is the fifth post in a series of posts related to metadata edit events for the UNT Libraries’ Digital Collections from January 1, 2014 to December 31, 2014. If you are interested in the previous posts in this series, they talked about the when, what, who, and first steps of duration.

In this post we are going to try and come up with the “average” amount of time spent on metadata edits in the dataset.

The first thing I wanted to do was to figure out which of the values mentioned in the previous post about duration buckets I could ignore as noise in the dataset.

As a reminder the duration data for metadata edit events is started when a user opens a metadata record in the edit system, and finished when they submit the record back to the system as a publish event. The duration is the difference in seconds of those two time timestamps.

There are a number of factors that can cause the duration data to vary wildly, a user can have a number of tabs open at the same time while only working on one of them. They may open a record and then walk off without editing that record. They could also be using a browser automation tool like Selenium that automates the metadata edits and therefore pushes the edit time down considerably.

In doing some tests of my own editing skills it isn’t unreasonable to have edits that are four or five seconds in duration if you are going in to change a known value from a simple dropdown. For example adding a language code to a photograph that you know should be “no-language” doesn’t take much time at all.

My gut feeling based on the data in the previous post was to say that edits that have a duration of over one hour should be considered outliers. This would remove 844 events from the total 94,222 edit events leaving me 93,378 (99%) of the events. This seemed like a logical first step but I was curious if there were other ways of approaching this.

I had a chat with the UNT Libraries’ Director of Research & Assessment Jesse Hamner and he suggested a few methods for me to look at.

IQR for calculating outliers

I took a stab at using the Interquartile Range of the dataset as the basis for identifying the outliers. With a little bit of R I was able to find the following information about the duration dataset.

 Min.   :     2.0  
 1st Qu.:    29.0  
 Median :    97.0  
 Mean   :   363.8  
 3rd Qu.:   300.0  
 Max.   :431644.0

With that I have Q1 of 29 and a Q3 of 300, this gives me an IQR of 271.

So the range for outliers is Q1–1.5 × IQR for the low end and Q3+1.5 × IQR on the high end.

With the numbers that says that values under -377.5 or over 706.5 should be considered outliers.

Note: I’m pretty sure there are some different ways of dealing IQR and datasets that end at Zero so that’s something to investigate.

For me the key here is that I’ve come up with 706.5 seconds being the ceiling for a valid event duration based on this method. Thats 11 minutes and 47 seconds. If I limit the dataset to edit events that are under 707 seconds I am left with 83,239 records. That is now just 88% of the dataset with 12% being considered an outlier. I thought this seemed to be too many records to ignore so after talking with my resident expert in the library I had a new method.

Two Standard Deviations

I took a look at what the timings would look look like if i based my outliers on the standard deviations. Edit events that are under 1,300 seconds (21 min 40 sec) in duration amount to 89,547 which is 95% of the values in the dataset. I also wanted to see what 2.5% of the dataset would look like. Edit durations under 2,100 seconds (35 minutes) result in 91,916 usable edit events for calculations which is right at 97.6%.

Comparing the methods

The following table takes the four duration ceilings that I tried. (IQR, 95 and 97.5, and gut feeling one hour) and makes them a bit more readable. The total number of duration events in the dataset before limiting is 94,222.

Duration Ceiling	Events Remaining	Events Removed	% remaining
707	83,239	10,983	88%
1,300	89,547	4,675	95%
2,100	91,916	2,306	97.6%
3,600	93,378	844	99%

Just for kicks I calculated the average time spent on editing records across the datasets that remained for the various cutoffs to get an idea how the ceilings changed things.

Duration Ceiling	Events Included	Events Ignored	Mean	Stddev	Sum	Average Edit Duration	Total Edit Hours
707	83,239	10,983	140.03	160.31	11,656,340	2:20	3,238
1,300	89,547	4,675	196.47	260.44	17,593,387	3:16	4,887
2,100	91,916	2,306	233.54	345.48	21,466,240	3:54	5,963
3,600	93,378	844	272.44	464.25	25,440,348	4:32	7,067
431,644	94,222	0	363.76	2311.13	34,274,434	6.04	9,521

In the table above you can see how the different duration ceilings do to the data analyzed. I calculated the mean of the various datasets, and their standard deviations (really Solr statsComponent did that). I converted those Means into minutes and seconds in the “Average Edit Duration” column and the final column is the number of person hours that were spent editing metadata in 2014 based on the various datasets.

In going forward I will be using 2,100 seconds as my duration ceiling and ignoring the edit events that took longer than that period of time. I’m going to do a little work in figuring out the costs associated with metadata creation in our collections for the last year. So check back for the next post in this series.

As always feel free to contact me via Twitter if you have questions or comments.