I wanted to take a look at the date values that had made their way into the DPLA dataset from the various Hubs. The first thing that I was curious about was how many unique date strings are present in the dataset, it turns out that there are 280,592 unique date strings.
Here are the top ten date strings, their instance and then if the string is a valid EDTF string.
|Date Value||Instances||Valid EDTF|
|1935 – 1945||27,143||FALSE|
It looks like “[Date Unavailable]” is a value used by the New York Public Library in denoting that an item does not have an available date. It should be noted that NYPL also has 377,664 items in the DPLA that have no date value present at all, so this isn’t a default behavior for items without a date. Most likely it is practice within a single division that denotes unknown or missing dates this way. The value “1939-1939” is used heavily by the University of Southern California. Libraries and seems to come from a single set of WPA Census Cards in their collection. The value “1960-1990” is used primarily for the items in the J. Paul Getty Trust.
I was also curious as to the length of the dates in the dataset. I was sure that I would find large numbers of date strings that were four digits in length (1923), ten digits in length (1923-03-04) and other lengths for common highly used date formats. I also figured that there would be instances of dates that were either less than four digits and also longer than one would expect for a date string. Here are some example date strings for both.
Top ten date strings shorter than four characters
I’m not sure what “*” means for a date value, but the other values seem to be Japanese versions of four digit dates (this is what google translate tells me). There are 14,402 records that have date strings shorter than three characters and a total of 522 unique date strings present.
Top ten date strings longer than fifty characters.
|Miniature repainted: 12th century AH/AD 18th (Safavid)||35|
|Some repainting: 13th century AH/AD 19th century (Safavid||25|
|11th century AH/AD 17th century-13th century AH/AD 19th century (Safavid (?))||15|
|1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939||13|
|10th century AH/AD 16th century-12th century AH/AD 18th century (Ottoman)||10|
|late 11th century AH/AD 17th century-early 12th century AH/AD 18th century (Ottoman)||8|
|5th century AH/AD 11th century-6th century AH/AD 12th century (Abbasid)||7|
|4th quarter 8th century AH/AD 14th century (Mamluk)||5|
|L’an III de la République française … [1794-1795]||5|
|Began with 1st rept. (112th Congress, 1st session, published June 24, 2011)||3|
There are 1,033 items with 894 unique values that are over fifty characters in length. The longest is a “date string” 193 characters, with a value of “chez W. Innys, J. Brotherton, R. Ware, W. Meadows, T. Meighan, J. & P. Knapton, J. Brindley, J. Clarke, S. Birt, D. Browne, T. Dongman, J. Shuckburgh, C. Hitch, J. Hodges, S. Austen, A. Millar,” which appears to be a mis-placement of another field’s data.
Here is the distribution of these items with date strings with fifty characters in length or more.
|Hub Name||Items with Date Strings 50 Characters or Longer|
|United States Government Printing Office (GPO)||683|
|Mountain West Digital Library||31|
|University of Illinois at Urbana-Champaign||3|
|J. Paul Getty Trust||2|
|North Carolina Digital Heritage Center||2|
It seems that a large portion of these 50+ character date strings are present in the Government Printing Office records.
Another way of looking at dates that I experimented with for this project was to convert a date string into what I’m calling a “date pattern”. For this I take an input string, say “1940-03-22” and that would get mapped to 0000-00-00. I convert all digits to zero, all letters to the letter a and leave all characters that are not alpha-numeric.
Below is the function that I use for this.
def get_date_pattern(date_string): pattern =  if date_string is None: return None for c in date_string: if c.isalpha(): pattern.append("a") elif c.isdigit(): pattern.append("0") else: pattern.append(c) return "".join(pattern)
By applying this function to all of the date strings in the dataset I’m able to take a look at what overall date patterns (and also features) are being used throughout the dataset, and ignore the specific values.
There are a total of 74 different date patterns for date strings that are valid EDTF. For those date strings that are not valid date strings, there are a total of 13,643 date strings. I’ve pulled the top ten date patterns for both valid EDTF and not valid EDTF date strings and presented them below.
Valid EDTF Date Patterns
|Valid EDTF Date Pattern||Instances||Example|
You can see that the basic date formats yyyy, yyyy-mm-dd, and yyyy-mm very popular in the dataset. Following that intervals are used in the format of yyyy/yyyy and uncertain dates with yyyy?.
Non-Valid EDTF Date Patterns
|Non-Valid EDTF Date Pattern||Instances||Example|
|[aaaa aaaaaaaaaaa]||183,825||[Date Unavailable]|
|00 aaa 0000||143,423||22 Jan 2006|
|0000 – 0000||134,408||2000 – 2005|
|0 aaa 0000||62,950||3 Jan 2000|
|aaa 0000||43,676||Jan 2000|
Many of the date strings that are represented by these dates have the possibility of being “cleaned up” by simple transforms if that was of interest. I would imagine that converting the 0000-0000 to 0000/0000 would be a fairly lossless transform that would suddenly change over a million items so that they are valid EDTF. Converting the format 00/00/0000 to 0000-00-00 is also a straight-forward transform if you know if 00-00 is mm-dd (US) or dd-mm (non-US). Removing the brackets around four digit years  seems to be another easy fix to convert a large number of dates. Of the top ten non-valid EDTF Date Patterns, it might be possible to convert nine of them with simple transformations to become valid EDTF date strings. This would give the DPLA 2,360,113 additional dates that are valid EDTF date strings. The values for the date pattern [aaaa aaaaaaaaaaa] with a date string value of [Date Unavailable] might benefit from being removed from the dataset altogether in order to reduce some of the noise in the field.
Common Patterns Per Hub
One last thing that I wanted to do was to see i there are any commonalities between the hubs when you look at their most frequently used date patterns. Below I’ve created tables for both valid EDTF date patterns and non-valid EDTF date patterns.
Valid EDTF Patterns
|Hub Name||Pattern 1||Pattern 2||Pattern 3||Pattern 4||Pattern 5|
|Biodiversity Heritage Library||0000||-0000||0000/0000||0000-00||0000?|
|Digital Library of Georgia||0000-00-00||0000-00||0000/0000||0000||0000-00-00/0000-00-00|
|J. Paul Getty Trust||0000||0000?|
|Kentucky Digital Library||0000|
|Minnesota Digital Library||0000||0000-00-00||0000?||0000-00||0000-00-00?|
|Mountain West Digital Library||0000-00-00||0000||0000-00||0000?||0000-00-00a00:00:00a|
|National Archives and Records Administration||0000||0000?|
|North Carolina Digital Heritage Center||0000-00-00||0000||0000-00||0000/0000||0000?|
|South Carolina Digital Library||0000-00-00||0000||0000-00||0000?|
|The New York Public Library||0000-00-00||0000-00||0000||-0000||0000-00-00/0000-00-00|
|The Portal to Texas History||0000-00-00||0000||0000-00||[0000-00-00..0000-00-00]||0000~|
|United States Government Printing Office (GPO)||0000||0000?||aaaa||-0000||[0000, 0000]|
|University of Illinois at Urbana-Champaign||0000||0000-00-00||0000?||0000-00|
|University of Southern California. Libraries||0000-00-00||0000/0000||0000||0000-00||0000-00/0000-00|
|University of Virginia Library||0000-00-00||0000||0000-00||0000?||0000?-00|
I tried to color code the five most common EDTF date patterns from above in the following image.
I’m not sure if that makes it clear or not where the common date patterns fall or not.
Non Valid EDTF Patterns
|Hub Name||Pattern 1||Pattern 2||Pattern 3||Pattern 4||Pattern 5|
|ARTstor||0000-0000||aa. 0000||aaaaaaa||0000a||aa. 0000-0000|
|Biodiversity Heritage Library||0000-0000||0000 – 0000||0000-||0000-00||[0000-0000]|
|Digital Library of Georgia||0000-0000||0000-00-00||0000-00- 00||aaaaa 0000||0000a|
|Harvard Library||0000a-0000a||a. 0000||0000a||0000-0000||0000 – a. 0000|
|J. Paul Getty Trust||0000-0000||a. 0000-0000||a. 0000||[000-]||[aa. 0000]|
|Kentucky Digital Library|
|Minnesota Digital Library||0000 – 0000||0000-00 – 0000-00||0000-0000||0000-00-00 – 0000-00-00||0000 – 0000?|
|Missouri Hub||a0000||0000-00-00||aaaaaaaa 00, 0000||aaaaaaa 00, 0000||aaaaaaaa 0, 0000|
|Mountain West Digital Library||0000-0000||aa. 0000-0000||aa. 0000||0000? – 0000?||0000 aa|
|National Archives and Records Administration||00/00/0000||00/0000||a’aa. 0000′-a’aa. 0000′||a’00/0000′-a’00/0000′||a’00/00/0000′-a’00/00/0000′|
|North Carolina Digital Heritage Center||0000-0000||00000000||00000000-00000000||aa. 0000-0000||aa. 0000|
|Smithsonian Institution||0000-0000||00 aaa 0000||0000-aaa-00||0 aaa 0000||aaa 0000|
|South Carolina Digital Library||0000-0000||0000 – 0000||0000-||0000-00-00||0000-0-00|
|The New York Public Library||0000-0000||[aaaa aaaaaaaaaaa]||0000 – 0000||0000-00-00 – 0000-00-00||0000-|
|The Portal to Texas History||a. 0000||||0000 – 0000||[aaaaaaa 0000 aaa 0000]||a.0000 – 0000|
|United States Government Printing Office (GPO)||||0000-0000||[0000?]||aaaaa aaaa 0000||00aa-0000|
|University of Illinois at Urbana-Champaign||0-00-00||a. 0000||00/00/00||0-0-00||00-00-00|
|University of Southern California. Libraries||0000-0000||aaaaa 0000/0000||aaaaa 0000-00-00/0000-00-00||0000a||aaaaa 0000-0000|
|University of Virginia Library||aaaaaaa aaaa||a0000||aaaaaaa 0000 aaa 0000?||aaaaaaa 0000 aaa 0000||00–?|
With the non-valid EDTF Date Patterns you can see where some of the date patterns are much more common across the various Hubs than others.
I hope you have found these posts interesting. If you’ve worked with metadata, especially aggregated metadata you will no doubt recognize much of this from your datasets, if you are new to this area or haven’t really worked with the wide range of date values that you can come in contact with in large metadata collections, have no fear, it is getting better. The EDTF is a very good specification for cultural heritage institutions to adopt for their digital collections. It helps to provide both a machine and human readable format for encoding and notating the complex dates we have to work with in our field.
If there is another field that you would like me to take a look at in the DPLA dataset, please let me know.
As always feel free to contact me via Twitter if you have questions or comments.