Monthly Archives: April 2016

DPLA Descriptive Metadata Lengths: By Provider/Hub

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider min max count sum mean stddev
artstor 0 6,868 128,922 9,413,898 73.02 178.31
bhl 0 100 123,472 775,600 6.28 8.48
cdl 0 6,714 563,964 65,221,428 115.65 211.47
david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92
digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09
digitalnc 1 9,785 241,275 45,759,118 189.66 262.89
esdn 0 9,136 197,396 23,620,299 119.66 170.67
georgia 0 12,546 875,158 135,691,768 155.05 210.85
getty 0 2,699 264,268 80,243,547 303.64 273.36
gpo 0 1,969 690,353 33,007,265 47.81 58.20
harvard 0 2,277 23,646 2,424,583 102.54 194.02
hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03
indiana 0 4,477 73,385 6,893,350 93.93 189.30
internet_archive 0 7,685 523,530 41,713,913 79.68 174.94
kdl 0 974 144,202 390,829 2.71 24.95
mdl 0 40,598 483,086 105,858,580 219.13 345.47
missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08
mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51
nara 0 2,000 700,948 1,425,165 2.03 28.13
nypl 0 2,633 1,170,357 48,750,103 41.65 161.88
scdl 0 3,362 159,681 18,422,935 115.37 164.74
smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37
the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95
tn 0 46,312 151,334 30,513,013 201.63 248.79
uiuc 0 4,942 63,412 3,782,743 59.65 172.44
undefined_provider 0 469 11,436 2,373 0.21 6.09
usc 0 29,861 1,076,031 60,538,490 56.26 193.20
virginia 0 268 30,174 301,042 9.98 17.91
washington 0 1,000 42,024 5,258,527 125.13 177.40

This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Description Field Analysis: Yes there really are 44 “page” long description fields.

In my previous post I mentioned that I was starting to take a look at the descriptive metadata fields in the metadata collected and hosted by the Digital Public Library of America.  That last post focused on records, how many records had description fields present, and how many were missing.  I also broke those numbers into the Provider/Hub groupings present in the DPLA dataset to see if there were any patterns.

Moving on the next thing I wanted to start looking at was data related to each instance of the description field.  I parsed each of the description fields, calculated a variety of statistics using that description field and then loaded that into my current data analysis tool, Solr which acts as my data store and my full-text index.

After about seven hours of processing I ended up with 17,884,946 description fields from the 11,654,800 records in the dataset.  You will notice that we have more descriptions than we do records, this is because a record can have more than one instance of a description field.

Lets take a look at a few of the high-level metrics.

Cardinality

I first wanted to find out the cardinality of the lengths of the description fields.  When I indexed each of the descriptions,  I counted the number of characters in the description and saved that as an integer in a field called desc_length_i in the Solr index.  Once it was indexed, it was easy to retrieve the number of unique values for length that were present.  There are 5,287 unique description lengths in the 17,884,946 descriptions that were are analyzing.  This isn’t too surprising or meaningful by itself, just a bit of description of the dataset.

I tried to make a few graphs to show the lengths and how many descriptions had what length.  Here is what I came up with.

Length of Descriptions in dataset

Length of Descriptions in dataset

You can see a blue line barely,  the problem is that the zero length records are over 4 million and the longer records are just single instances.

Here is a second try using a log scale for the x axis

Length of Descriptions in dataset (x axis log)

Length of Descriptions in dataset (x axis log)

This reads a little better I think, you can see that there is a dive down from zero lengths and then at about 10 characters long there is a spike up again.

One more graph to see what we can see,  this time a log-log plot of the data.

Length of Descriptions in dataset (log-log)

Length of Descriptions in dataset (log-log)

Average Description Lengths

Now that we are finished with the cardinality of the lengths,  next up is to figure out what the average description length is for the entire dataset.  This time the Solr StatsComponent is used and makes getting these statistics a breeze.  Here is a small table showing the output from Solr.

min max count missing sum sumOfSquares mean stddev
0 130,592 17,884,946 0 1,490,191,622 2,621,904,732,670 83.32 373.71

Here we see that the minimum length for a description is zero characters (a record without a description present has a length of zero for that field in this model).  The longest record in the dataset is 130,592 characters long.  The total number of characters present in the dataset was nearly one and a half billion characters.  Finally the number that we were after is the average length of a description, this turns out to be 83.32 characters long.

For those that might be curious what 84 characters (I rounded up instead of down) of description looks like,  here is an example.

Aerial photograph of area near Los Angeles Memorial Coliseum, Los Angeles, CA, 1963.

So not a horrible looking length for a description.  It feels like it is just about one sentence long with 13 “words” in this sentence.

Long descriptions

Jumping back a bit to look at the length of the longest description field,  that description is 130,592 characters long.  If you assume that the average single spaced page is 3,000 characters long, this description field is 43.5 pages long.  The reader of this post that has spent time with aggregated metadata will probably say “looks like someone put the full-text of the item into the record”.  If you’ve spent some serious (or maybe not that serious) time in the metadata mines (trenches?) you would probably mumble somethings like “ContentDM grumble grumble” and you would be right on both accounts.  Here is the record on the DPLA site with the 130,492 character long description – http://dp.la/item/40a4f5069e6bf02c3faa5a445656ea61

The next thing I was curious about was the number of descriptions that were “long”.  To answer this I am going to require a little bit of back of the envelope freedom right now to decide what “long” is for a description field in a metadata record.  (In future blog posts I might be able to answer this with different analysis on the data but this hopefully will do for today.)  For now I’m going to arbitrarily decide that anything over 325 characters in length is going to be considered “too long”.

Descriptions: Too Long and Not Too Long

Descriptions: Too Long and Not Too Long

Looking at that pie chart,  there are 5.8% of the descriptions that are “too long” based on my ad-hoc metric from above.  This 5.8% of the records make up 708,050,671 or  48% of the 1,490,191,622 characters in the entire dataset.  I bet if you looked a little harder you would find that the description field gets very close to the 80/20 rule with 20% of the descriptions accounting for 80% of the overall description length.

Short descriptions

Now that we’ve worked with long descriptions, the next thing we should look at are the number of descriptions that are “short” in length.

There are 4,113,841 records that don’t have a description in the DPLA dataset.  This means that for this analysis 4,113,841(23%) of the descriptions have a length of 0.  There are 2,041,527 (11%) descriptions that have a length between 1 and 10 characters in length. Below is the breakdown of these ten counts,  you can see that there is a surprising number (777,887) of descriptions that have a single character as their descriptive contribution to the dataset.

Descriptions 10 characters or less

Descriptions 10 characters or less

There is also an interesting spike at ten characters in length where suddenly we jump to over 500,000 descriptions in the DPLA.

So what?

Now that we have the average length of a description in the DPLA dataset,  the number of records that we consider “long” and the number of records that we consider “short”.  I think the very next question that gets asked is “so what?”

I think there are four big reasons that I’m working on this kind of project with the DPLA data.

One is that the DPLA is the largest aggregation of descriptive metadata in the US for digital resources in cultrual heritage institutions. This is important because you get to take a look at a wide variety of data input rules, practices, and conversions from local systems to an aggregated metadata system.

Secondly this data is licensed with a CC0 license and in a bulk data format so it is easy to grab the data and start working with it.

Thirdly there haven’t been that many studies on descriptive metadata like this that I’m aware of. OCLC will publish analysis on their MARC catalog data from time to time, and the research that was happening at UIUC in the GSILS with IMLS funded metadata isn’t going on anymore (great work to look at by the way)  so there really aren’t that many discussions about using large scale aggregations of metadata to understand the practices in place in cultural heritage institutions across the US.  I am pretty sure that there is work being carried out across the Atlantic with the Eureopana datasets that are available.

Finally I think that this work can lead to metadata quality assurance practices and indicators for metadata creators and aggregators about what may be wrong with their metadata (a message saying “your description is over a page long, what’s up with that?”).

I don’t think there are many answers so far in this work but I feel that they are moving us in the direction of a better understanding of our descriptive metadata world in the context of these large aggregations of metadata.

If you have questions or comments about this post,  please let me know via Twitter.