Extended Date Time Format (EDTF) use in the DPLA: Part 2, EDTF use by Hub

This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America.  For more background on this topic take a look at the first post in this series.

EDTF Use by Hub

In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system.  Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.

Looking overall at the dates that were present,  there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.

One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.

Hub Name Items With Date % of total items with date present Valid EDTD Valid EDTF % Not Valid EDTF Not Valid EDTF %
ARTstor 49,908 88.6% 26,757 53.6% 23,151 46.4%
Biodiversity Heritage Library 29,000 21.0% 22,734 78.4% 6,266 21.6%
David Rumsey 48,132 100.0% 48,132 100.0% 0 0.0%
Digital Commonwealth 118,672 95.1% 14,731 12.4% 103,941 87.6%
Digital Library of Georgia 236,961 91.3% 188,263 79.4% 48,687 20.5%
Harvard Library 6,957 65.8% 6,910 99.3% 47 0.7%
HathiTrust 1,881,588 98.2% 1,295,986 68.9% 585,598 31.1%
Internet Archive 194,454 93.1% 185,328 95.3% 9,126 4.7%
J. Paul Getty Trust 92,494 99.8% 6,319 6.8% 86,175 93.2%
Kentucky Digital Library 87,061 68.1% 87,061 100.0% 0 0.0%
Minnesota Digital Library 39,708 98.0% 33,201 83.6% 6,507 16.4%
Missouri Hub 34,742 83.6% 32,192 92.7% 2,550 7.3%
Mountain West Digital Library 634,571 73.1% 545,663 86.0% 88,908 14.0%
National Archives and Records Administration 553,348 78.9% 10,218 1.8% 543,130 98.2%
North Carolina Digital Heritage Center 214,134 82.1% 163,030 76.1% 51,104 23.9%
Smithsonian Institution 675,648 75.3% 44,860 6.6% 630,788 93.4%
South Carolina Digital Library 52,328 68.9% 42,128 80.5% 10,200 19.5%
The New York Public Library 791,912 67.7% 47,257 6.0% 744,655 94.0%
The Portal to Texas History 424,342 88.8% 416,835 98.2% 7,505 1.8%
United States Government Printing Office (GPO) 148,548 99.9% 17,894 12.0% 130,654 88.0%
University of Illinois at Urbana-Champaign 14,273 78.8% 11,304 79.2% 2,969 20.8%
University of Southern California. Libraries 269,880 89.6% 114,293 42.3% 155,573 57.6%
University of Virginia Library 26,072 86.4% 21,798 83.6% 4,274 16.4%

Turning this into a graph helps things show up a bit better.

EDTF info for each of the DPLA Hubs

EDTF info for each of the DPLA Hubs

There are a number of things that can be teased out of here,  first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already,  notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%.  Harvard at 99% and the Portal to Texas History at 98% are also notable.  On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid,  New York Public Library with 94%, and the J Paul Getty Trust at 93%.

Use of EDTF Level Features

The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2.  Level 0 are the basic date features such as date, date and time, and intervals.  Level 1 adds features like
Uncertain/Approximate dates,  Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates,  partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits.  Finally Level 2 lets you qualify seasons.  For a full list of the features please take a look at the draft specification at the Library of Congress.

When I was preparing the dataset I also tested the dates to see which feature level they matched to.  After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here.  Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2,  I was able to come up with usable data for further analysis.  Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.

Hub Name valid EDTF items valid-level0 % Level0 valid-level1 % Level1 valid-level2 % Level2
ARTstor 26,757 26,726 99.9% 31 0.1% 0 0.0%
Biodiversity Heritage Library 22,734 22,702 99.9% 32 0.1% 0 0.0%
David Rumsey 48,132 48,132 100.0% 0 0.0% 0 0.0%
Digital Commonwealth 14,731 14,731 100.0% 0 0.0% 0 0.0%
Digital Library of Georgia 188,274 188,274 100.0% 0 0.0% 0 0.0%
Harvard Library 6,910 6,822 98.7% 83 1.2% 5 0.1%
HathiTrust 1,295,990 1,292,079 99.7% 3,662 0.3% 249 0.0%
Internet Archive 185,328 185,115 99.9% 212 0.1% 1 0.0%
J. Paul Getty Trust 6,319 6,308 99.8% 11 0.2% 0 0.0%
Kentucky Digital Library 87,061 87,061 100.0% 0 0.0% 0 0.0%
Minnesota Digital Library 33,201 26,055 78.5% 7,146 21.5% 0 0.0%
Missouri Hub 32,192 32,190 100.0% 2 0.0% 0 0.0%
Mountain West Digital Library 545,663 542,388 99.4% 3,274 0.6% 1 0.0%
National Archives and Records Administration 10,218 10,003 97.9% 215 2.1% 0 0.0%
North Carolina Digital Heritage Center 163,030 162,958 100.0% 72 0.0% 0 0.0%
Smithsonian Institution 44,860 44,642 99.5% 218 0.5% 0 0.0%
South Carolina Digital Library 42,128 42,079 99.9% 49 0.1% 0 0.0%
The New York Public Library 47,257 47,251 100.0% 6 0.0% 0 0.0%
The Portal to Texas History 416,838 402,845 96.6% 6,302 1.5% 7,691 1.8%
United States Government Printing Office (GPO) 17,894 16,165 90.3% 875 4.9% 854 4.8%
University of Illinois at Urbana-Champaign 11,304 11,275 99.7% 29 0.3% 0 0.0%
University of Southern California. Libraries 114,307 114,307 100.0% 0 0.0% 0 0.0%
University of Virginia Library 21,798 21,558 98.9% 236 1.1% 4 0.0%

Looking at the top 25% of the data,  you get the following.

EDTF Level Use by Hub

EDTF Level Use by Hub

Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.

There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records.  The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).

I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.

Feel free to contact me via Twitter if you have questions or comments.