DPLA Descriptive Metadata Lengths: By Provider/Hub

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider min max count sum mean stddev
artstor 0 6,868 128,922 9,413,898 73.02 178.31
bhl 0 100 123,472 775,600 6.28 8.48
cdl 0 6,714 563,964 65,221,428 115.65 211.47
david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92
digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09
digitalnc 1 9,785 241,275 45,759,118 189.66 262.89
esdn 0 9,136 197,396 23,620,299 119.66 170.67
georgia 0 12,546 875,158 135,691,768 155.05 210.85
getty 0 2,699 264,268 80,243,547 303.64 273.36
gpo 0 1,969 690,353 33,007,265 47.81 58.20
harvard 0 2,277 23,646 2,424,583 102.54 194.02
hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03
indiana 0 4,477 73,385 6,893,350 93.93 189.30
internet_archive 0 7,685 523,530 41,713,913 79.68 174.94
kdl 0 974 144,202 390,829 2.71 24.95
mdl 0 40,598 483,086 105,858,580 219.13 345.47
missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08
mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51
nara 0 2,000 700,948 1,425,165 2.03 28.13
nypl 0 2,633 1,170,357 48,750,103 41.65 161.88
scdl 0 3,362 159,681 18,422,935 115.37 164.74
smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37
the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95
tn 0 46,312 151,334 30,513,013 201.63 248.79
uiuc 0 4,942 63,412 3,782,743 59.65 172.44
undefined_provider 0 469 11,436 2,373 0.21 6.09
usc 0 29,861 1,076,031 60,538,490 56.26 193.20
virginia 0 268 30,174 301,042 9.98 17.91
washington 0 1,000 42,024 5,258,527 125.13 177.40

This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post,  please let me know via Twitter.