Monthly Archives: February 2015

DPLA Metadata Analysis: Part 3 – Where to go from here.

This is the last of three posts about working with the Digital Public Library of America’s (DPLA) metadata to demonstrate some of the analysis that can be done using Solr and a little bit of time and patience. Here are links to the first and second post in the series.

What I wanted to talk about in this post is how can we use this data to help improve access to our digital resources in the DPLA, and also be able to measure that we’ve in fact improved when we go out to spend resources, both time and money on metadata work.

The first thing I think we need to do is to make an assumption to frame this conversation. For now let’s say that the presence of subjects in a metadata record is a positive indicator of quality. And that for the most part a record that has three or more subjects (controlled, keywords, whatever) improves the access to resources in metadata aggregation systems like the DPLA which doesn’t have the benefit of full-text for searching.

So out of the numbers we’ve looked at so far, which ones are the ones to pay the most attention to.

Zero Subjects

For me it is focusing on the number of records that have zero subject headings that are already online. Going from 0-1 subject headings is much more of an improvement for access than going from 1-2, 2-3,3-4,4-8,8-15 subjects per record. So once we have all records with at least one subject we can move on. We can measure this directly with the metric for how many records have zero subjects that I introduced last post.

There are currently 1,827,276 records in the DPLA that have no subjects or keywords. This accounts for 23% of the DPLA dataset analyzed for these blog posts. I think this is a pretty straightforward area to work on related to metadata improvement.

Dead end subjects

One are we could work to improve is when we have subjects that are either only used once in the DPLA as a whole, or only once within a single Hub. Reducing this number would allow for more avenues for navigation between records by connecting them via subject when available. There isn’t anything bad about unique subject headings within a community, but if a record doesn’t have a way to get you to like records (assuming there are like records within a collection) then it isn’t as useful as one that connects you to more, similar items.  There of course are many legitimate reasons that there is only one instance of a subject in a dataset and I don’t think that we should strive to remove them completely,  but reducing the number overall would be an indicator of improvement in my book.

In the last post I had a table that had the number of unique subjects and the number of subjects that were unique to a single Hub.  I was curious about the percentage of subjects from a Hub that were unique to just that Hub based on the number of unique subjects.  Here is that table.

Hub Name Records Unique Subjects # of subjects unique to hub % of subjects that are unique to hub
ARTstor 56,342 9,560 4,941 52%
Biodiversity Heritage Library 138,288 22,004 9,136 42%
David Rumsey 48,132 123 30 24%
Digital Commonwealth 124,804 41,704 31,094 75%
Digital Library of Georgia 259,640 132,160 114,689 87%
Harvard Library 10,568 9,257 7,204 78%
HathiTrust 1,915,159 685,733 570,292 83%
Internet Archive 208,953 56,911 28,978 51%
J. Paul Getty Trust 92,681 2,777 1,852 67%
Kentucky Digital Library 127,755 1,972 1,337 68%
Minnesota Digital Library 40,533 24,472 17,545 72%
Missouri Hub 41,557 6,893 4,338 63%
Mountain West Digital Library 867,538 227,755 192,501 85%
National Archives and Records Administration 700,952 7,086 3,589 51%
North Carolina Digital Heritage Center 260,709 99,258 84,203 85%
Smithsonian Institution 897,196 348,302 325,878 94%
South Carolina Digital Library 76,001 23,842 18,110 76%
The New York Public Library 1,169,576 69,210 52,002 75%
The Portal to Texas History 477,639 104,566 87,076 83%
United States Government Printing Office (GPO) 148,715 174,067 105,389 61%
University of Illinois at Urbana-Champaign 18,103 6,183 3,076 50%
University of Southern California. Libraries 301,325 65,958 51,822 79%
University of Virginia Library 30,188 3,736 2,425 65%

Here is the breakdown when grouped by type of Hub,  either Service-Hub or Content-Hub

Hub Type Records Unique Subjects Subjects unique to Hub Type % of Subjects unique to Hub Type
Content Hubs 5,736,178 1,311,830 1,253,769 96%
Service Hubs 2,276,176 618,081 560,049 91%

Or another way to look at how the subjects are shared between the different types of Hubs is the following graph.

Subjects unique to and shared between Hub Types.

Subjects unique to and shared between Hub Types.

It appears that there is a small number (3%) of subjects that are shared between Hub types.  Would increasing this number improve the ability for users to discover resources better from multiple Hubs?

More, More, More

I think once we’ve looked at the ways mentioned above I think that we should work to up the number of subjects per record within a given Hub. I don’t think there is a magic number for everyone, but at UNT we try and have three subjects for each record whenever possible. So that’s what we are shooting for. We can easily see improvement by looking at the mean and see if it goes up (even ever so slightly up)

Next Steps

I think that there is some work that we could do to identify which records need specific kinds work for subjects based on more involved processing of the input records, but I’m going to leave that for another post and probably another flight somewhere to work on.

Hope you enjoyed these three posts and hope they resonate at least a bit with you.

Feel free to send me a not on twitter if you have questions, comments, or idea for me about this.

DPLA Metadata Analysis: Part 2 – Beyond basic stats

More stats for subjects

In my previous post I displayed some of the statistics that are readily available from Solr as part of its StatsComponent functionality (if you haven’t used this part of Solr yet you really should). There are a few other things that we could collect to get a more complete picture of a metadata field.

So far we have min, max, number of records, total number of subjects, sumofsquares, mean, and standard deviation. The other values I think we should take a look at are the following.

Records without Subjects – Number of records without subjects.

Percent of records without subjects – Percentage of the Hubs records that don’t have subjects

Mode – Number of subjects-per-record that is the most common for a specific Hub.

Unique Subjects – Unique subject strings are present for a specific Hub

Hub Unique Subjects – Number of subjects that are unique to that Hub.

Entropy of the field. – This calculation is a measure of the uncertainty in the metadata field, but for our purposes it is a good measure to understand how the distribution of subjects happens in the records.

Below is a table that contains the fields listed above,  plus some relevant fields from the previous post. Each Hub has a row in this table.

Hub Name Records Records Without Subjects % without Subjects Avg. Subjects per records Subject Count Mode Unique Subjects # of subjects unique to hub Entropy
ARTstor 56,342 6,586 11.7 3.5 3 9,560 4,941 0.73
Biodiversity Heritage Library 138,288 10,326 7.5 3.3 2 22,004 9,136 0.65
David Rumsey 48,132 30,167 62.7 0.5 0 123 30 0.76
Digital Commonwealth 124,804 6,040 4.8 2.4 1 41,704 31,094 0.77
Digital Library of Georgia 259,640 3,216 1.2 4.4 2 132,160 114,689 0.67
Harvard Library 10,568 167 1.6 2.5 2 9,257 7,204 0.76
HathiTrust 1,915,159 525,874 27.5 1.4 1 685,733 570,292 0.88
Internet Archive 208,953 44,872 21.5 1.8 1 56,911 28,978 0.80
J. Paul Getty Trust 92,681 73,978 79.8 0.4 0 2,777 1,852 0.60
Kentucky Digital Library 127,755 117,790 92.2 0.2 0 1,972 1,337 0.62
Minnesota Digital Library 40,533 0 0 5 4 24,472 17,545 0.74
Missouri Hub 41,557 11,451 27.6 2.3 0 6,893 4,338 0.69
Mountain West Digital Library 867,538 49,473 5.7 3 1 227,755 192,501 0.68
National Archives and Records Administration 700,952 619,212 88.3 0.3 0 7,086 3,589 0.63
North Carolina Digital Heritage Center 260,709 41,323 15.9 3.3 2 99,258 84,203 0.66
Smithsonian Institution 897,196 29,452 3.3 6.4 7 348,302 325,878 0.62
South Carolina Digital Library 76,001 7,460 9.8 3 2 23,842 18,110 0.72
The New York Public Library 1,169,576 208,472 17.8 1.7 1 69,210 52,002 0.62
The Portal to Texas History 477,639 58 0 11 10 104,566 87,076 0.49
United States Government Printing Office (GPO) 148,715 1,794 1.2 3.1 2 174,067 105,389 0.92
University of Illinois at Urbana-Champaign 18,103 4,221 23.3 3.8 0 6,183 3,076 0.63
University of Southern California. Libraries 301,325 35,106 11.7 2.9 2 65,958 51,822 0.59
University of Virginia Library 30,188 229 0.8 3.2 1 3,736 2,425 0.60

In looking at the row for The Portal to Texas History we can see that of the 477,639 records in the dataset, 58 of them do not have any subjects,  which is a very small percentage (0.01214306202 to be exact).  From there we can go to the average of 11 subjects per record with a mode of 10,  nothing earth shaking here,  just more info.  There are 104,566 unique subjects in the Portal’s dataset with 87,076 of those being unique to only the Portal.  Finally the entropy for the Portal’s subject field is 0.49,  if compared to GPO’s which is 0.92 you can interpret this to mean that the subject values are more “clumpy” for the Portal,  (a smaller number of subjects are used across a larger number of records) than for GPO (a larger number of subjects are used across records).

The following two tables further illustrate the entropy values on the Portal’s and GPO’s subjects. The first table is the top ten subjects and the number of records with those subjects from the GPO’s dataset

National security–United States 1,138
United States. Congress. House–Rules and practice 748
Terrorism–United States–Prevention 718
United States. Department of Defense–Appropriations and expenditures 631
United States 536
Social security–United States–Periodicals 487
Emergency management–United States 485
Medicare 441
Consumer protection–United States 417
Wisconsin–Maps 406

Now take a look at the top ten subjects and their counts for the Portal.

Places 310,404
United States 306,597
Texas 305,551
Business, Economics and Finance 248,455
Communications 223,783
Newspapers 221,422
Advertising 218,527
Journalism 217,737
Landscape and Nature 76,308
Geography and Maps 70,742

So with the entropy value,  you can read a lower number to be more like the Portal’s subjects and the higher number to be more like GPO’s.  At the extreme,  a value of 1.0 would mean that every subject is used by one record, and value of 0 would mean that there is only one subject with all of the records using said subject.

Shared Subjects

In creating the table above I had to work out the number subjects that a hub has uniquely.  In doing so I went ahead and calculated this number for the whole dataset to find out how much subject overlap occurs.

The table below displays the breakdown of how subjects are distributed across Hub collection.  For example if two Hubs have the subject “Laws of Texas” then it is said to be shared by two Hubs.  The breakdown for the metadata in the DPLA is as follows.

# of Hubs with subject Count
1 1,717,512
2 114,047
3 21,126
4 8,013
5 3,905
6 2,187
7 1,330
8 970
9 689
10 494
11 405
12 302
13 245
14 199
15 152
16 117
17 63
18 62
19 32
20 20
21 7
22 7

Most of the subjects 1,717,512 to be exactly occur in only one Hub’s collection.

There are seven different subjects that are common across 22 of the 23 Hubs in the DPLA metadata dataset,  if you are curious,  theses subjects are the following:

There should be one final post in this series where I can hopefully suggest what we should do with this data.

Again, if you want to chat about this post,  hit me up on Twitter.

 

DPLA Metadata Analysis: Part 1 – Basic stats on subjects

One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.

I was interested in finding out the following pieces of information.

  1. What is the average number and standard deviation of subjects-per-record in the DPLA
  2. How does this number compare across the partners?
  3. Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.

Building the Dataset

The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.

With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.

So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.

Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).

So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality.  For a description of this view the documentation on Solr’s documentation site.

Answering the questions

The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).

Answering question number two involved a small script to create a table for us, you will find that table below.

Hub Name min max count sum sumOfSquares mean stddev
ARTstor 0 71 56,342 194,948 1351826 3.460083064 3.467168662
Biodiversity Heritage Library 0 118 138,288 454,624 3100134 3.287515909 3.407385646
David Rumsey 0 4 48,132 22,976 33822 0.477353943 0.689083212
Digital Commonwealth 0 199 124,804 295,778 1767426 2.369940066 2.923194479
Digital Library of Georgia 0 161 259,640 1,151,369 8621935 4.43448236 3.680038874
Harvard Library 0 17 10,568 26,641 88155 2.520912188 1.409567895
HathiTrust 0 92 1,915,159 2,614,199 6951217 1.365003637 1.329038361
Internet Archive 0 68 208,953 385,732 1520200 1.84602279 1.966605872
J. Paul Getty Trust 0 36 92,681 32,999 146491 0.356049244 1.20575216
Kentucky Digital Library 0 13 127,755 26,009 82269 0.203584987 0.776219692
Minnesota Digital Library 1 78 40,533 202,484 1298712 4.995534503 2.661891328
Missouri Hub 0 139 41,557 97,115 606761 2.336910749 3.023203782
Mountain West Digital Library 0 129 867,538 2,641,065 17734515 3.044321978 3.34282307
National Archives and Records Administration 0 103 700,952 231,513 1143343 0.330283671 1.233711342
North Carolina Digital Heritage Center 0 1,476 260,709 869,203 8394791 3.333996908 4.591774892
Smithsonian Institution 0 548 897,196 5,763,459 56446687 6.423857217 4.652809633
South Carolina Digital Library 0 40 76,001 231,270 1125030 3.042986277 2.354387181
The New York Public Library 0 31 1,169,576 1,996,483 6585169 1.707014337 1.648179106
The Portal to Texas History 0 1,035 477,639 5,257,702 69662410 11.00768991 4.96771802
United States Government Printing Office (GPO) 0 30 148,715 457,097 1860297 3.073644219 1.749820977
University of Illinois at Urbana-Champaign 0 22 18,103 67,955 404383 3.753797713 2.871821391
University of Southern California. Libraries 0 119 301,325 863,535 4626989 2.865792749 2.672589058
University of Virginia Library 0 15 30,188 95,328 465286 3.157811051 2.332671249

The columns are min which is the minimum number of subjects per record for a given Hub,  Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items.  The column max shows the highest number of subjects per record.  Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub,  note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number.  The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.

The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.

Hub Type                                  min max count sum sumOfSquares mean stddev
Content-Hub 0 548 5,736,178 13,207,489 84723999 2.302489393 3.077118385
Service-Hub 0 1,476 2,276,176 10,771,995 109293849 4.73249652 5.061612337

It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs,  over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.

Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.

I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.

Hit me up on Twitter if you want to chat about this.

How we assign unique identifiers

The UNT Libraries has made use of the ARK identifier specification for a number of years and have used these identifiers throughout our infrastructure on a number of levels.  This post is to give a little background about where, when, why and a little about how we assign our ARK identifiers.

Terminology

The first thing we need to do is get some terminology out of the way so that we can talk about the parts consistently.  This is taken from the ARK documentation

  http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff
   ________________/ __/ ___/ ______/ ____________/
     (replaceable)     |     |      |       Qualifier
          |       ARK Label  |      |    (NMA-supported)
          |                  |      |
Name Mapping Authority       |    Name (NAA-assigned)
         (NMA)               |
                  Name Assigning Authority Number (NAAN)

The ARK syntax can be summarized,

  [http://NMA/]ark:/NAAN/Name[Qualifier]

For the UNT Libraries we were assigned a Name Assigning Authority Number (NAAN) of 67531 so all of our identifiers will start like this ark:/67531/

We mint Names for our ARKs locally with a home-grown system locally called a “Number Server”  this Python Web service receives a request for a new number,  assigns that number a prefix based on which instance we pull from and returns the new Name.

Namespaces

We have four different namespaces that we use for minting identifiers.  They are the following,  metapth, metadc, metarkv, and coda.  Additionally we have a metatest namespace which we use when we need to test things out but it isn’t used that often.  Finally we have a historic namespace that is no longer used that is metacrs. Here is the breakdown of how we use these namespaces.

We try to assign all items that end up on The Portal to Texas History with Names from the metapth namespace whenever possible.  We assign all other public facing digital objects the metadc namespace.  This means that the UNT Digital Library and The Gateway to Oklahoma History both share Names from the metadc namespace.  The metarkv namespace is used for “archive only” objects that go directly into our archival repository system,  these include large Web archiving datasets.  The coda namespace is used within our archival repository called Coda.  As was stated earlier the metatest namespace is only used for testing and these items are thrown away after processing.

Name assignment

We assign Names in our systems in programatic ways,  this is always done as part of our digital item ingest process.  We tend to process items in batches,  most often we try to process several hundred items at any given time and sometimes we process several thousand items.   When we process items they are processed in parallel and therefore there is no logical order to how the Names are assigned to objects.  They are in the order that they were processed but may have no logical order past that.

We also don’t assume that our Names are continuous.  If you have an identifier metapth123 and metapth125 we don’t assume that there is an item metapth124,  sure it may be there,  but it also may never have been assigned.  When we first started with these systems we would get worked up if we assigned several hundred or a few thousands identifiers and then had to delete those items,  now this isn’t an issue at all but that took some time to get over.

Another assumption that can’t be made in our systems is that if you have an item,  Newspaper Vol 1 Issue 2 that has an identifier of metapth333 there is no guarantee that Newspaper Vol. 1 Issue 3 will have metapth334,  it might but it isn’t guaranteed either.  Another thing that happens in our systems is that items can be shared between systems and the membership to either the Portal, UNT Digital Library or Gateway is notated in the descriptive metadata.  Therefore you can’t say all metapth* identifiers are Portal or all metadc* identifiers are not the Portal, you have to look them up based on the metadata.

Once a number is assigned it is never assigned again.  This sounds like a silly thing to say but it is important to remember,  we don’t try and save identifiers, or reuse them as if we will run out of them.

Level of assignment

We currently assign an ARK identifier at the level of the intellectual object. So for example,  a newspaper issue gets and ARK, a photograph gets an ARK, a book, a map, a report, an audio recording, a video recording gets an ARK.  The sub-parts of an item are not given further unique identifiers because the way that we tend to interface with them is in the form of formatted URLs such as those described here or from other URL based patterns such as the URLs we use to retrieve items from Coda.

http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/manifest-md5.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/coda_directives.py
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bagit.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bag-info.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/0=untl_aip_1.0
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/queries.xlsx
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/README.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/ba3ce7a1-0e3b-44cb-8b41-5d9d1b0438fe.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/7fe68777-54a2-4c71-95b2-aa33204ae84b.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadc498968.aip.mets.xml

Lessons Learned

Things I would do again.

  • I would most likely use just an incrementing counter for assigning identifiers.  Name minters such as Noid are also an option but I like the numbers with a short prefix.
  • I would not use a prefix such as UNT do stay away from branding as much as possible.  Even metapth is way too branded (see below).

Things I would change in our implementation.

  • I would only have one namespace for non-archival items.  Two namespaces for production data just invite someone to screw up (usually me) and then suddenly the reason for having one namespace over the other is meaningless.  Just manage one namespace and move on.
  • I would not have a six or seven character prefix.  metapth and metadc came as baggage from our first system,  we decided that the 30k identifiers we already minted had set our path.  Now after 1,077,975 identifiers in those namespaces,  it seems a little silly that those the first 3% of our items would have such an effect on us still today.
  • I would not brand our namespaces so closely to our systems names such as metapth, metadc, and the legacy metacrs people read too much into the naming convention.  This is a big reason for opaque Names in the first place, and is pretty important.

Things I might change in a future implementation.

  • I would probably pad my identifiers out to eight digits.   While you can’t rely on the ARKs to be generated in a given order, once they are assigned it is helpful to be able to sort by them and have a consistent order,  metapth1, metapth100, metapth100000 don’t always sort nicely like metapth00000001, metapth00000100, metapth00100000 do.  But then again longer run numbers of zeros are harder to transcribe and I had a tough time just writing this example.  Maybe I wouldn’t do this.

I don’t think any of this post applies only to ARK identifiers as most identifier schemes at some level have to have a decision made about how you are going to mint unique names for things.   So hopefully this is useful to others.

If you have any specific questions for me let me know on twitter.