I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA, I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.
A Bitter Harvest
One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting. One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers. This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.
One thing that came out of many of these aggregation projects, and something that many of us are dealing with today is the fact that “dates are hard”.
Extended Date Time Format
A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.
An example of what this documented was how to represent some of the following date concepts in a machine readable way.
Commonly Used Dates
|Date Feature||Example Item||Format||Example Date|
|Year||Book with publication year||YYYY||1902|
|Month||Monthly journal issue||YYYY-MM||1893-05|
|Interval||Compiled court documents||YYYY/YYYY||1887/1889|
|Season||Seasonal magazine issue||YYYY-SS||1957-23|
|Approximate||Map “circa 1886”||YYYY~||1886~|
Some Complex Dates
|Example Item||Kind of Date||Format||Example Date|
|Photo taken at some point during an event August 6-9, 1992||One of a Set||[YYYY..YYYY]||[1992-08-06..1992-08-09]|
|Hand-carved object, “circa 1870s”||Extended Interval (L1)||YYYY~/YYYY~||1870~/1879~|
|Envelope with a partially-legible postmark||Unspecified||“u” in place of digit(s)||18uu-08-1u|
|Map possibly created in 1607 or 1630||One of a Set, Uncertain||[YYYY, YYYY]||[1607?, 1630?]|
The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).
Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time. As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.
EDTF use in the DPLA
When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed. In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties. In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.
Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF, I had the following questions in mind when I started the analysis presented in this post and a few that will follow.
- How many date values in the DPLA are valid EDTF values?
- How are these valid EDTF values distributed across the Hubs?
- What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
- What are the most common date format patterns used in the DPLA?
With these questions in mind I started the analysis
Preparing the Dataset
I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.
I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid. Additionally I tested which level of EDTF each value conformed to. Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.
This resulted in the following fields being indexed for each date
For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.
After several hours of indexing these values into Solr, I was able to start answering some of the questions mentioned above.
Date usage in the DPLA
The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates. Of the 8,012,390 items in my copy of the DPLA dataset, 6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.
I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.
|Hub Name||Items||Items With Date||Items With Date %||Items Missing Date||Items Missing Date %|
|Biodiversity Heritage Library||138,288||29,000||21.0%||109,288||79.0%|
|Digital Library of Georgia||259,640||236,961||91.3%||22,679||8.7%|
|J. Paul Getty Trust||92,681||92,494||99.8%||187||0.2%|
|Kentucky Digital Library||127,755||87,061||68.1%||40,694||31.9%|
|Minnesota Digital Library||40,533||39,708||98.0%||825||2.0%|
|Mountain West Digital Library||867,538||634,571||73.1%||232,967||26.9%|
|National Archives and Records Administration||700,952||553,348||78.9%||147,604||21.1%|
|North Carolina Digital Heritage Center||260,709||214,134||82.1%||46,575||17.9%|
|South Carolina Digital Library||76,001||52,328||68.9%||23,673||31.1%|
|The New York Public Library||1,169,576||791,912||67.7%||377,664||32.3%|
|The Portal to Texas History||477,639||424,342||88.8%||53,297||11.2%|
|United States Government Printing Office (GPO)||148,715||148,548||99.9%||167||0.1%|
|University of Illinois at Urbana-Champaign||18,103||14,273||78.8%||3,830||21.2%|
|University of Southern California. Libraries||301,325||269,880||89.6%||31,445||10.4%|
|University of Virginia Library||30,188||26,072||86.4%||4,116||13.6%|
I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA, the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library. There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA. This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)
When you look at Content-Hubs vs Service-Hubs you see the following.
|Hub Type||Items||Items With Date||Items With Date %||Items Missing Date||Items Missing Date %|
It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.
Valid EDTF Dates
I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification. It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.
So the split is pretty close.
One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.
In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.
As always feel free to contact me via Twitter if you have questions or comments.