March | 2016 | mark e. phillips journal

Last year I took a look at the subject field and the date fields in the Digital Public Library of America (DPLA). This time around I wanted to begin looking at the description field and see what I could see.

Before diving into the analysis, I think it is important to take a look at a few things. First off, when you reference the DPLA Metadata Application Profile v4, you may notice that the description field is not a required field, in fact the field doesn’t show up in APPENDIX B: REQUIRED, REQUIRED IF AVAILABLE, AND RECOMMENDED PROPERTIES. From that you can assume that this field is very optional. Also, the description field when present is often used to communicate a variety of information to the user. The DPLA data has examples that are clearly rights statements, notes, physical descriptions of the item, content descriptions of the item, and in some instances a place to store identifiers or names. Of all of the fields that one will come into contact in the DPLA dataset, I would image that the description field is probably one of the ones with the highest variability of content. So with that giant caveat, let’s get started.

So on to the data.

The DPLA makes available a data dump of the metadata in their system. Last year I was analyzing just over 8 million records, this year the collection has grown to more than 11 million records ( 11,654,800 in the dataset I’m using).

The first thing that I had to accomplish was to pull out just the descriptions from the full json dataset that I downloaded. I was interested in three values for each record, specifically the Provider or “Hub”, the DPLA identifier for the item and finally the description fields. I finally took the time to look at jq, which made this pretty easy.

For those that are interested here is what I came up with to extract the data I wanted.

zcat all.json.gz | jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (._source.id), 'descriptions': ._source.sourceResource.description?}'

This results in an output that look like this.

{"provider":"http://dp.la/api/contributor/cdl","id":"4fce5c56d60170c685f1dc4ae8fb04bf","descriptions":["Lang: Charles Aikin Collection"]}
{"provider":"http://dp.la/api/contributor/cdl","id":"bca3f20535ed74edb20df6c738184a84","descriptions":["Lang: Maire, graveur."]}
{"provider":"http://dp.la/api/contributor/cdl","id":"76ceb3f9105098f69809b47aacd4e4e0","descriptions":null}
{"provider":"http://dp.la/api/contributor/cdl","id":"88c69f6d29b5dd37e912f7f0660c67c6","descriptions":null}

From there my plan was to write some short python scripts that can read a line, convert it from json into a python object and then do programmy stuff with it.

Who has what?

After parsing the data a bit I wanted to remind myself of the spread of the data in the DPLA collection. There is a page on the DPLA’s site http://dp.la/partners/ that shows you how many records have been contributed by which Hub in the network. This is helpful but I wanted to draw a bar graph to give a visual representation of this data.

DPLA Partner Records

As has been the case since it was added, Hathitrust is the biggest provider of records to the DPLA with other 2.4 million records. Pretty amazing!

There are three other Hubs/Providers that contribute over 1 million records each, The Smithsonian, New York Public Library, and the University of Southern California Libraries. Down from there there are three more that contribute over half a million records, Mountain West Digital Library, National Archives and Records Administration (NARA) and The Portal to Texas History.

There were 11,410 records (coded as undefined_provider) that are not currently associated with a Hub/Provider, probably a data conversion error somewhere during the record ingest pipeline.

Which have descriptions

After the reminder about the size and shape of the Hubs/Providers in the DPLA dataset, we can dive right into the data and see quickly how well represented in the data the description field is.

We can start off with another graph.

Percent of Hubs/Providers with and without descriptions

You can see that some of the Hubs/Providers have very few records (< 2%) with descriptions (Kentucky Digital Library, NARA) while others had a very high percentage (> 95%) of records with description fields present (David Rumsey, Digital Commonwealth, Digital Library of Georgia, J. Paul Getty Trust, Government Publishing Office, The Portal to Texas History, Tennessee Digital Library, and the University of Illinois at Urbana-Champaign).

Below is a full breakdown for each Hub/Provider showing how many and what percentage of the records have zero descriptions, or one or more descriptions.

Provider	Records	0 Descriptions	1+ Descriptions	0 Descriptions %	1+ Descriptions %
artstor	107,665	40,851	66,814	37.94%	62.06%
bhl	123,472	64,928	58,544	52.59%	47.41%
cdl	312,573	80,450	232,123	25.74%	74.26%
david_rumsey	65,244	168	65,076	0.26%	99.74%
digital-commonwealth	222,102	8,932	213,170	4.02%	95.98%
digitalnc	281,087	70,583	210,504	25.11%	74.89%
esdn	197,396	48,660	148,736	24.65%	75.35%
georgia	373,083	9,344	363,739	2.50%	97.50%
getty	95,908	229	95,679	0.24%	99.76%
gpo	158,228	207	158,021	0.13%	99.87%
harvard	14,112	3,106	11,006	22.01%	77.99%
hathitrust	2,474,530	1,068,159	1,406,371	43.17%	56.83%
indiana	62,695	18,819	43,876	30.02%	69.98%
internet_archive	212,902	40,877	172,025	19.20%	80.80%
kdl	144,202	142,268	1,934	98.66%	1.34%
mdl	483,086	44,989	438,097	9.31%	90.69%
missouri-hub	144,424	17,808	126,616	12.33%	87.67%
mwdl	932,808	57,899	874,909	6.21%	93.79%
nara	700,948	692,759	8,189	98.83%	1.17%
nypl	1,170,436	775,361	395,075	66.25%	33.75%
scdl	159,092	33,036	126,056	20.77%	79.23%
smithsonian	1,250,705	68,871	1,181,834	5.51%	94.49%
the_portal_to_texas_history	649,276	125	649,151	0.02%	99.98%
tn	151,334	2,463	148,871	1.63%	98.37%
uiuc	18,231	127	18,104	0.70%	99.30%
undefined_provider	11,422	11,410	12	99.89%	0.11%
usc	1,065,641	852,076	213,565	79.96%	20.04%
virginia	30,174	21,081	9,093	69.86%	30.14%
washington	42,024	8,838	33,186	21.03%	78.97%

With so many of the Hub/Providers having a high percentage of records with descriptions, I was curious about the overall records in the DPLA. Below is a pie chart that shows you what I found.

DPLA records with and without descriptions

Almost 2/3 of the records in the DPLA have at least one description field, this is more than I would have expected for an un-required, un-recommended field, but I think this is probably a good thing.

Descriptions per record

The final thing I wanted to look at in this post was the average number of description fields for each of the Hubs/Providers. This time we will start off with the data table below.

Provider	Providers	min	median	max	mean	stddev
artstor	107,665	0	1	5	0.82	0.84
bhl	123,472	0	0	1	0.47	0.50
cdl	312,573	0	1	10	1.55	1.46
david_rumsey	65,244	0	3	4	2.55	0.80
digital-commonwealth	222,102	0	2	17	2.01	1.15
digitalnc	281,087	0	1	19	0.86	0.67
esdn	197,396	0	1	1	0.75	0.43
georgia	373,083	0	2	98	2.32	1.56
getty	95,908	0	2	25	2.75	2.59
gpo	158,228	0	4	65	4.37	2.53
harvard	14,112	0	1	11	1.46	1.24
hathitrust	2,474,530	0	1	77	1.22	1.57
indiana	62,695	0	1	98	0.91	1.21
internet_archive	212,902	0	2	35	2.27	2.29
kdl	144,202	0	0	1	0.01	0.12
mdl	483,086	0	1	1	0.91	0.29
missouri-hub	144,424	0	1	16	1.05	0.70
mwdl	932,808	0	1	15	1.22	0.86
nara	700,948	0	0	1	0.01	0.11
nypl	1,170,436	0	0	2	0.34	0.47
scdl	159,092	0	1	16	0.80	0.41
smithsonian	1,250,705	0	2	179	2.19	1.94
the_portal_to_texas_history	649,276	0	2	3	1.96	0.20
tn	151,334	0	1	1	0.98	0.13
uiuc	18,231	0	3	25	3.47	2.13
undefined_provider	11,422	0	0	4	0.00	0.08
usc	1,065,641	0	0	6	0.21	0.43
virginia	30,174	0	0	1	0.30	0.46
washington	42,024	0	1	1	0.79	0.41

This time with an image

Average number of descriptions per record

You can see that there are several Hubs/Providers a have multiple descriptions per record, with the Government Publishing Office coming in at 4.37 descriptions per record.

I found it interesting that when you exclude the two Hubs/Providers that don’t really do descriptions (KDL and NARA) you see two that have a very low standard deviation from their mean (average) Tennessee Digital Library at 0.13 and The Portal to Texas History at 0.20 don’t drift much from their almost one description-per-record for Tennessee and almost two descriptions-per-record for Texas. It makes me think that this is probably a set of records that each of those Hubs/Providers would like to have identified so they could go in and add a few descriptions.

Closing

Well that wraps up this post that I hope is the first in a series of posts about the description field in the DPLA dataset. In subsequent posts we will move away from record level analysis of description fields and get down to the field level to do some analysis of the descriptions themselves. I have a number of predictions but I will hold onto those for now.

If you have questions or comments about this post, please let me know via Twitter.

mark e. phillips journal

Monthly Archives: March 2016

Beginning to look at the description field in the DPLA

So on to the data.

Who has what?

Which have descriptions

Descriptions per record

Closing