Extended Date Time Format (EDTF) use in the DPLA: Part 1

I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA,  I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.

A Bitter Harvest

One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting.  One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers.  This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.

One thing that came out of many of these aggregation projects,  and something that many of us are dealing with today is the fact that “dates are hard”.

Extended Date Time Format

A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.

An example of what this documented was how to represent some of the following date concepts in a machine readable way.

Commonly Used Dates

 Date Feature Example Item Format Example Date
Year Book with publication year YYYY 1902
Month Monthly journal issue YYYY-MM 1893-05
Day Letter YYYY-MM-DD 1924-03-03
Time Born-digital photo YYYY-MM-DDTHH:MM:SS 2003-12-27T11:09:08
Interval Compiled court documents YYYY/YYYY 1887/1889
Season Seasonal magazine issue YYYY-SS 1957-23
Decade WWII poster YYYu 194u
Approximate Map “circa 1886” YYYY~ 1886~

Some Complex Dates

Example Item Kind of Date Format Example Date
Photo taken at some point during an event August 6-9, 1992 One of a Set [YYYY..YYYY] [1992-08-06..1992-08-09]
Hand-carved object, “circa 1870s” Extended Interval (L1) YYYY~/YYYY~ 1870~/1879~
Envelope with a partially-legible postmark Unspecified “u” in place of digit(s) 18uu-08-1u
Map possibly created in 1607 or 1630 One of a Set, Uncertain [YYYY, YYYY] [1607?, 1630?]

The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).

Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time.  As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.

EDTF use in the DPLA

When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed.  In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties.  In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.

Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF,  I had the following questions in mind when I started the analysis presented in this post and a few that will follow.

  • How many date values in the DPLA are valid EDTF values?
  • How are these valid EDTF values distributed across the Hubs?
  • What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
  • What are the most common date format patterns used in the DPLA?

With these questions in mind I started the analysis

Preparing the Dataset

I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.

I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid.  Additionally I tested which level of EDTF each value conformed to.  Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.

This resulted in the following fields being indexed for each date

Field Value
date 2014-04-04
date_valid_edtf true
date_level0_feature true
date_level1_feature false
date_level2_feature false
date_pattern 0000-00-00

For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.

After several hours of indexing these values into Solr,  I was able to start answering some of the questions mentioned above.

Date usage in the DPLA

The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates.  Of the 8,012,390 items in my copy of the DPLA dataset,  6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.

I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.

Hub Name Items Items With Date Items With Date % Items Missing Date Items Missing Date %
ARTstor 56,342 49,908 88.6% 6,434 11.4%
Biodiversity Heritage Library 138,288 29,000 21.0% 109,288 79.0%
David Rumsey 48,132 48,132 100.0% 0 0.0%
Digital Commonwealth 124,804 118,672 95.1% 6,132 4.9%
Digital Library of Georgia 259,640 236,961 91.3% 22,679 8.7%
Harvard Library 10,568 6,957 65.8% 3,611 34.2%
HathiTrust 1,915,159 1,881,588 98.2% 33,571 1.8%
Internet Archive 208,953 194,454 93.1% 14,499 6.9%
J. Paul Getty Trust 92,681 92,494 99.8% 187 0.2%
Kentucky Digital Library 127,755 87,061 68.1% 40,694 31.9%
Minnesota Digital Library 40,533 39,708 98.0% 825 2.0%
Missouri Hub 41,557 34,742 83.6% 6,815 16.4%
Mountain West Digital Library 867,538 634,571 73.1% 232,967 26.9%
National Archives and Records Administration 700,952 553,348 78.9% 147,604 21.1%
North Carolina Digital Heritage Center 260,709 214,134 82.1% 46,575 17.9%
Smithsonian Institution 897,196 675,648 75.3% 221,548 24.7%
South Carolina Digital Library 76,001 52,328 68.9% 23,673 31.1%
The New York Public Library 1,169,576 791,912 67.7% 377,664 32.3%
The Portal to Texas History 477,639 424,342 88.8% 53,297 11.2%
United States Government Printing Office (GPO) 148,715 148,548 99.9% 167 0.1%
University of Illinois at Urbana-Champaign 18,103 14,273 78.8% 3,830 21.2%
University of Southern California. Libraries 301,325 269,880 89.6% 31,445 10.4%
University of Virginia Library 30,188 26,072 86.4% 4,116 13.6%
Presence of Dates by Hub Name

Presence of Dates by Hub Name

I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA,  the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library.  There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA.  This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)

When you look at Content-Hubs vs Service-Hubs you see the following.

Hub Type Items Items With Date Items With Date % Items Missing Date Items Missing Date %
Content-Hub 5,736,178 4,782,214 83.4% 953,964 16.6%
Service-Hub 2,276,176 1,842,519 80.9% 433,657 19.1%

It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.

Valid EDTF Dates

I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification.  It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.

EDTF Valid vs Not Valid

EDTF Valid vs Not Valid

So the split is pretty close.

One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF.  For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.

In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.

As always feel free to contact me via Twitter if you have questions or comments.