This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America. For more background on this topic take a look at the first post in this series.
EDTF Use by Hub
In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system. Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.
Looking overall at the dates that were present, there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.
One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.
I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.
Hub Name | Items With Date | % of total items with date present | Valid EDTD | Valid EDTF % | Not Valid EDTF | Not Valid EDTF % |
ARTstor | 49,908 | 88.6% | 26,757 | 53.6% | 23,151 | 46.4% |
Biodiversity Heritage Library | 29,000 | 21.0% | 22,734 | 78.4% | 6,266 | 21.6% |
David Rumsey | 48,132 | 100.0% | 48,132 | 100.0% | 0 | 0.0% |
Digital Commonwealth | 118,672 | 95.1% | 14,731 | 12.4% | 103,941 | 87.6% |
Digital Library of Georgia | 236,961 | 91.3% | 188,263 | 79.4% | 48,687 | 20.5% |
Harvard Library | 6,957 | 65.8% | 6,910 | 99.3% | 47 | 0.7% |
HathiTrust | 1,881,588 | 98.2% | 1,295,986 | 68.9% | 585,598 | 31.1% |
Internet Archive | 194,454 | 93.1% | 185,328 | 95.3% | 9,126 | 4.7% |
J. Paul Getty Trust | 92,494 | 99.8% | 6,319 | 6.8% | 86,175 | 93.2% |
Kentucky Digital Library | 87,061 | 68.1% | 87,061 | 100.0% | 0 | 0.0% |
Minnesota Digital Library | 39,708 | 98.0% | 33,201 | 83.6% | 6,507 | 16.4% |
Missouri Hub | 34,742 | 83.6% | 32,192 | 92.7% | 2,550 | 7.3% |
Mountain West Digital Library | 634,571 | 73.1% | 545,663 | 86.0% | 88,908 | 14.0% |
National Archives and Records Administration | 553,348 | 78.9% | 10,218 | 1.8% | 543,130 | 98.2% |
North Carolina Digital Heritage Center | 214,134 | 82.1% | 163,030 | 76.1% | 51,104 | 23.9% |
Smithsonian Institution | 675,648 | 75.3% | 44,860 | 6.6% | 630,788 | 93.4% |
South Carolina Digital Library | 52,328 | 68.9% | 42,128 | 80.5% | 10,200 | 19.5% |
The New York Public Library | 791,912 | 67.7% | 47,257 | 6.0% | 744,655 | 94.0% |
The Portal to Texas History | 424,342 | 88.8% | 416,835 | 98.2% | 7,505 | 1.8% |
United States Government Printing Office (GPO) | 148,548 | 99.9% | 17,894 | 12.0% | 130,654 | 88.0% |
University of Illinois at Urbana-Champaign | 14,273 | 78.8% | 11,304 | 79.2% | 2,969 | 20.8% |
University of Southern California. Libraries | 269,880 | 89.6% | 114,293 | 42.3% | 155,573 | 57.6% |
University of Virginia Library | 26,072 | 86.4% | 21,798 | 83.6% | 4,274 | 16.4% |
Turning this into a graph helps things show up a bit better.
There are a number of things that can be teased out of here, first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already, notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%. Harvard at 99% and the Portal to Texas History at 98% are also notable. On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid, New York Public Library with 94%, and the J Paul Getty Trust at 93%.
Use of EDTF Level Features
The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2. Level 0 are the basic date features such as date, date and time, and intervals. Level 1 adds features like
Uncertain/Approximate dates, Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates, partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits. Finally Level 2 lets you qualify seasons. For a full list of the features please take a look at the draft specification at the Library of Congress.
When I was preparing the dataset I also tested the dates to see which feature level they matched to. After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here. Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2, I was able to come up with usable data for further analysis. Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.
Hub Name | valid EDTF items | valid-level0 | % Level0 | valid-level1 | % Level1 | valid-level2 | % Level2 |
ARTstor | 26,757 | 26,726 | 99.9% | 31 | 0.1% | 0 | 0.0% |
Biodiversity Heritage Library | 22,734 | 22,702 | 99.9% | 32 | 0.1% | 0 | 0.0% |
David Rumsey | 48,132 | 48,132 | 100.0% | 0 | 0.0% | 0 | 0.0% |
Digital Commonwealth | 14,731 | 14,731 | 100.0% | 0 | 0.0% | 0 | 0.0% |
Digital Library of Georgia | 188,274 | 188,274 | 100.0% | 0 | 0.0% | 0 | 0.0% |
Harvard Library | 6,910 | 6,822 | 98.7% | 83 | 1.2% | 5 | 0.1% |
HathiTrust | 1,295,990 | 1,292,079 | 99.7% | 3,662 | 0.3% | 249 | 0.0% |
Internet Archive | 185,328 | 185,115 | 99.9% | 212 | 0.1% | 1 | 0.0% |
J. Paul Getty Trust | 6,319 | 6,308 | 99.8% | 11 | 0.2% | 0 | 0.0% |
Kentucky Digital Library | 87,061 | 87,061 | 100.0% | 0 | 0.0% | 0 | 0.0% |
Minnesota Digital Library | 33,201 | 26,055 | 78.5% | 7,146 | 21.5% | 0 | 0.0% |
Missouri Hub | 32,192 | 32,190 | 100.0% | 2 | 0.0% | 0 | 0.0% |
Mountain West Digital Library | 545,663 | 542,388 | 99.4% | 3,274 | 0.6% | 1 | 0.0% |
National Archives and Records Administration | 10,218 | 10,003 | 97.9% | 215 | 2.1% | 0 | 0.0% |
North Carolina Digital Heritage Center | 163,030 | 162,958 | 100.0% | 72 | 0.0% | 0 | 0.0% |
Smithsonian Institution | 44,860 | 44,642 | 99.5% | 218 | 0.5% | 0 | 0.0% |
South Carolina Digital Library | 42,128 | 42,079 | 99.9% | 49 | 0.1% | 0 | 0.0% |
The New York Public Library | 47,257 | 47,251 | 100.0% | 6 | 0.0% | 0 | 0.0% |
The Portal to Texas History | 416,838 | 402,845 | 96.6% | 6,302 | 1.5% | 7,691 | 1.8% |
United States Government Printing Office (GPO) | 17,894 | 16,165 | 90.3% | 875 | 4.9% | 854 | 4.8% |
University of Illinois at Urbana-Champaign | 11,304 | 11,275 | 99.7% | 29 | 0.3% | 0 | 0.0% |
University of Southern California. Libraries | 114,307 | 114,307 | 100.0% | 0 | 0.0% | 0 | 0.0% |
University of Virginia Library | 21,798 | 21,558 | 98.9% | 236 | 1.1% | 4 | 0.0% |
Looking at the top 25% of the data, you get the following.
Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.
There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records. The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).
I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.
Feel free to contact me via Twitter if you have questions or comments.