Monthly Archives: February 2015

DPLA Metadata Analysis: Part 3 – Where to go from here.

This is the last of three posts about working with the Digital Public Library of America’s (DPLA) metadata to demonstrate some of the analysis that can be done using Solr and a little bit of time and patience. Here are links to the first and second post in the series.

What I wanted to talk about in this post is how can we use this data to help improve access to our digital resources in the DPLA, and also be able to measure that we’ve in fact improved when we go out to spend resources, both time and money on metadata work.

The first thing I think we need to do is to make an assumption to frame this conversation. For now let’s say that the presence of subjects in a metadata record is a positive indicator of quality. And that for the most part a record that has three or more subjects (controlled, keywords, whatever) improves the access to resources in metadata aggregation systems like the DPLA which doesn’t have the benefit of full-text for searching.

So out of the numbers we’ve looked at so far, which ones are the ones to pay the most attention to.

Zero Subjects

For me it is focusing on the number of records that have zero subject headings that are already online. Going from 0-1 subject headings is much more of an improvement for access than going from 1-2, 2-3,3-4,4-8,8-15 subjects per record. So once we have all records with at least one subject we can move on. We can measure this directly with the metric for how many records have zero subjects that I introduced last post.

There are currently 1,827,276 records in the DPLA that have no subjects or keywords. This accounts for 23% of the DPLA dataset analyzed for these blog posts. I think this is a pretty straightforward area to work on related to metadata improvement.

Dead end subjects

One are we could work to improve is when we have subjects that are either only used once in the DPLA as a whole, or only once within a single Hub. Reducing this number would allow for more avenues for navigation between records by connecting them via subject when available. There isn’t anything bad about unique subject headings within a community, but if a record doesn’t have a way to get you to like records (assuming there are like records within a collection) then it isn’t as useful as one that connects you to more, similar items. There of course are many legitimate reasons that there is only one instance of a subject in a dataset and I don’t think that we should strive to remove them completely, but reducing the number overall would be an indicator of improvement in my book.

In the last post I had a table that had the number of unique subjects and the number of subjects that were unique to a single Hub. I was curious about the percentage of subjects from a Hub that were unique to just that Hub based on the number of unique subjects. Here is that table.

Hub Name	Records	Unique Subjects	# of subjects unique to hub	% of subjects that are unique to hub
ARTstor	56,342	9,560	4,941	52%
Biodiversity Heritage Library	138,288	22,004	9,136	42%
David Rumsey	48,132	123	30	24%
Digital Commonwealth	124,804	41,704	31,094	75%
Digital Library of Georgia	259,640	132,160	114,689	87%
Harvard Library	10,568	9,257	7,204	78%
HathiTrust	1,915,159	685,733	570,292	83%
Internet Archive	208,953	56,911	28,978	51%
J. Paul Getty Trust	92,681	2,777	1,852	67%
Kentucky Digital Library	127,755	1,972	1,337	68%
Minnesota Digital Library	40,533	24,472	17,545	72%
Missouri Hub	41,557	6,893	4,338	63%
Mountain West Digital Library	867,538	227,755	192,501	85%
National Archives and Records Administration	700,952	7,086	3,589	51%
North Carolina Digital Heritage Center	260,709	99,258	84,203	85%
Smithsonian Institution	897,196	348,302	325,878	94%
South Carolina Digital Library	76,001	23,842	18,110	76%
The New York Public Library	1,169,576	69,210	52,002	75%
The Portal to Texas History	477,639	104,566	87,076	83%
United States Government Printing Office (GPO)	148,715	174,067	105,389	61%
University of Illinois at Urbana-Champaign	18,103	6,183	3,076	50%
University of Southern California. Libraries	301,325	65,958	51,822	79%
University of Virginia Library	30,188	3,736	2,425	65%

Here is the breakdown when grouped by type of Hub, either Service-Hub or Content-Hub

Hub Type	Records	Unique Subjects	Subjects unique to Hub Type	% of Subjects unique to Hub Type
Content Hubs	5,736,178	1,311,830	1,253,769	96%
Service Hubs	2,276,176	618,081	560,049	91%

Or another way to look at how the subjects are shared between the different types of Hubs is the following graph.

Subjects unique to and shared between Hub Types.

It appears that there is a small number (3%) of subjects that are shared between Hub types. Would increasing this number improve the ability for users to discover resources better from multiple Hubs?

More, More, More

I think once we’ve looked at the ways mentioned above I think that we should work to up the number of subjects per record within a given Hub. I don’t think there is a magic number for everyone, but at UNT we try and have three subjects for each record whenever possible. So that’s what we are shooting for. We can easily see improvement by looking at the mean and see if it goes up (even ever so slightly up)

Next Steps

I think that there is some work that we could do to identify which records need specific kinds work for subjects based on more involved processing of the input records, but I’m going to leave that for another post and probably another flight somewhere to work on.

Hope you enjoyed these three posts and hope they resonate at least a bit with you.

Feel free to send me a not on twitter if you have questions, comments, or idea for me about this.

DPLA Metadata Analysis: Part 2 – Beyond basic stats

More stats for subjects

In my previous post I displayed some of the statistics that are readily available from Solr as part of its StatsComponent functionality (if you haven’t used this part of Solr yet you really should). There are a few other things that we could collect to get a more complete picture of a metadata field.

So far we have min, max, number of records, total number of subjects, sumofsquares, mean, and standard deviation. The other values I think we should take a look at are the following.

Records without Subjects – Number of records without subjects.

Percent of records without subjects – Percentage of the Hubs records that don’t have subjects

Mode – Number of subjects-per-record that is the most common for a specific Hub.

Unique Subjects – Unique subject strings are present for a specific Hub

Hub Unique Subjects – Number of subjects that are unique to that Hub.

Entropy of the field. – This calculation is a measure of the uncertainty in the metadata field, but for our purposes it is a good measure to understand how the distribution of subjects happens in the records.

Below is a table that contains the fields listed above, plus some relevant fields from the previous post. Each Hub has a row in this table.

Hub Name	Records	Records Without Subjects	% without Subjects	Avg. Subjects per records	Subject Count Mode	Unique Subjects	# of subjects unique to hub	Entropy
ARTstor	56,342	6,586	11.7	3.5	3	9,560	4,941	0.73
Biodiversity Heritage Library	138,288	10,326	7.5	3.3	2	22,004	9,136	0.65
David Rumsey	48,132	30,167	62.7	0.5	0	123	30	0.76
Digital Commonwealth	124,804	6,040	4.8	2.4	1	41,704	31,094	0.77
Digital Library of Georgia	259,640	3,216	1.2	4.4	2	132,160	114,689	0.67
Harvard Library	10,568	167	1.6	2.5	2	9,257	7,204	0.76
HathiTrust	1,915,159	525,874	27.5	1.4	1	685,733	570,292	0.88
Internet Archive	208,953	44,872	21.5	1.8	1	56,911	28,978	0.80
J. Paul Getty Trust	92,681	73,978	79.8	0.4	0	2,777	1,852	0.60
Kentucky Digital Library	127,755	117,790	92.2	0.2	0	1,972	1,337	0.62
Minnesota Digital Library	40,533	0	0	5	4	24,472	17,545	0.74
Missouri Hub	41,557	11,451	27.6	2.3	0	6,893	4,338	0.69
Mountain West Digital Library	867,538	49,473	5.7	3	1	227,755	192,501	0.68
National Archives and Records Administration	700,952	619,212	88.3	0.3	0	7,086	3,589	0.63
North Carolina Digital Heritage Center	260,709	41,323	15.9	3.3	2	99,258	84,203	0.66
Smithsonian Institution	897,196	29,452	3.3	6.4	7	348,302	325,878	0.62
South Carolina Digital Library	76,001	7,460	9.8	3	2	23,842	18,110	0.72
The New York Public Library	1,169,576	208,472	17.8	1.7	1	69,210	52,002	0.62
The Portal to Texas History	477,639	58	0	11	10	104,566	87,076	0.49
United States Government Printing Office (GPO)	148,715	1,794	1.2	3.1	2	174,067	105,389	0.92
University of Illinois at Urbana-Champaign	18,103	4,221	23.3	3.8	0	6,183	3,076	0.63
University of Southern California. Libraries	301,325	35,106	11.7	2.9	2	65,958	51,822	0.59
University of Virginia Library	30,188	229	0.8	3.2	1	3,736	2,425	0.60

In looking at the row for The Portal to Texas History we can see that of the 477,639 records in the dataset, 58 of them do not have any subjects, which is a very small percentage (0.01214306202 to be exact). From there we can go to the average of 11 subjects per record with a mode of 10, nothing earth shaking here, just more info. There are 104,566 unique subjects in the Portal’s dataset with 87,076 of those being unique to only the Portal. Finally the entropy for the Portal’s subject field is 0.49, if compared to GPO’s which is 0.92 you can interpret this to mean that the subject values are more “clumpy” for the Portal, (a smaller number of subjects are used across a larger number of records) than for GPO (a larger number of subjects are used across records).

The following two tables further illustrate the entropy values on the Portal’s and GPO’s subjects. The first table is the top ten subjects and the number of records with those subjects from the GPO’s dataset

National security–United States	1,138
United States. Congress. House–Rules and practice	748
Terrorism–United States–Prevention	718
United States. Department of Defense–Appropriations and expenditures	631
United States	536
Social security–United States–Periodicals	487
Emergency management–United States	485
Medicare	441
Consumer protection–United States	417
Wisconsin–Maps	406

Now take a look at the top ten subjects and their counts for the Portal.

Places	310,404
United States	306,597
Texas	305,551
Business, Economics and Finance	248,455
Communications	223,783
Newspapers	221,422
Advertising	218,527
Journalism	217,737
Landscape and Nature	76,308
Geography and Maps	70,742

So with the entropy value, you can read a lower number to be more like the Portal’s subjects and the higher number to be more like GPO’s. At the extreme, a value of 1.0 would mean that every subject is used by one record, and value of 0 would mean that there is only one subject with all of the records using said subject.

Shared Subjects

In creating the table above I had to work out the number subjects that a hub has uniquely. In doing so I went ahead and calculated this number for the whole dataset to find out how much subject overlap occurs.

The table below displays the breakdown of how subjects are distributed across Hub collection. For example if two Hubs have the subject “Laws of Texas” then it is said to be shared by two Hubs. The breakdown for the metadata in the DPLA is as follows.

# of Hubs with subject	Count
1	1,717,512
2	114,047
3	21,126
4	8,013
5	3,905
6	2,187
7	1,330
8	970
9	689
10	494
11	405
12	302
13	245
14	199
15	152
16	117
17	63
18	62
19	32
20	20
21	7
22	7

Most of the subjects 1,717,512 to be exactly occur in only one Hub’s collection.

There are seven different subjects that are common across 22 of the 23 Hubs in the DPLA metadata dataset, if you are curious, theses subjects are the following:

There should be one final post in this series where I can hopefully suggest what we should do with this data.

Again, if you want to chat about this post, hit me up on Twitter.

DPLA Metadata Analysis: Part 1 – Basic stats on subjects

One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.

I was interested in finding out the following pieces of information.

What is the average number and standard deviation of subjects-per-record in the DPLA
How does this number compare across the partners?
Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.

Building the Dataset

The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.

With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.

So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.

Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).

So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality. For a description of this view the documentation on Solr’s documentation site.

Answering the questions

The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).

Answering question number two involved a small script to create a table for us, you will find that table below.

Hub Name	min	max	count	sum	sumOfSquares	mean	stddev
ARTstor	0	71	56,342	194,948	1351826	3.460083064	3.467168662
Biodiversity Heritage Library	0	118	138,288	454,624	3100134	3.287515909	3.407385646
David Rumsey	0	4	48,132	22,976	33822	0.477353943	0.689083212
Digital Commonwealth	0	199	124,804	295,778	1767426	2.369940066	2.923194479
Digital Library of Georgia	0	161	259,640	1,151,369	8621935	4.43448236	3.680038874
Harvard Library	0	17	10,568	26,641	88155	2.520912188	1.409567895
HathiTrust	0	92	1,915,159	2,614,199	6951217	1.365003637	1.329038361
Internet Archive	0	68	208,953	385,732	1520200	1.84602279	1.966605872
J. Paul Getty Trust	0	36	92,681	32,999	146491	0.356049244	1.20575216
Kentucky Digital Library	0	13	127,755	26,009	82269	0.203584987	0.776219692
Minnesota Digital Library	1	78	40,533	202,484	1298712	4.995534503	2.661891328
Missouri Hub	0	139	41,557	97,115	606761	2.336910749	3.023203782
Mountain West Digital Library	0	129	867,538	2,641,065	17734515	3.044321978	3.34282307
National Archives and Records Administration	0	103	700,952	231,513	1143343	0.330283671	1.233711342
North Carolina Digital Heritage Center	0	1,476	260,709	869,203	8394791	3.333996908	4.591774892
Smithsonian Institution	0	548	897,196	5,763,459	56446687	6.423857217	4.652809633
South Carolina Digital Library	0	40	76,001	231,270	1125030	3.042986277	2.354387181
The New York Public Library	0	31	1,169,576	1,996,483	6585169	1.707014337	1.648179106
The Portal to Texas History	0	1,035	477,639	5,257,702	69662410	11.00768991	4.96771802
United States Government Printing Office (GPO)	0	30	148,715	457,097	1860297	3.073644219	1.749820977
University of Illinois at Urbana-Champaign	0	22	18,103	67,955	404383	3.753797713	2.871821391
University of Southern California. Libraries	0	119	301,325	863,535	4626989	2.865792749	2.672589058
University of Virginia Library	0	15	30,188	95,328	465286	3.157811051	2.332671249

The columns are min which is the minimum number of subjects per record for a given Hub, Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items. The column max shows the highest number of subjects per record. Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub, note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number. The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.

The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.

Hub Type	min	max	count	sum	sumOfSquares	mean	stddev
Content-Hub	0	548	5,736,178	13,207,489	84723999	2.302489393	3.077118385
Service-Hub	0	1,476	2,276,176	10,771,995	109293849	4.73249652	5.061612337

It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs, over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.

Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.

I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.

Hit me up on Twitter if you want to chat about this.

How we assign unique identifiers

The UNT Libraries has made use of the ARK identifier specification for a number of years and have used these identifiers throughout our infrastructure on a number of levels. This post is to give a little background about where, when, why and a little about how we assign our ARK identifiers.

Terminology

The first thing we need to do is get some terminology out of the way so that we can talk about the parts consistently. This is taken from the ARK documentation

  http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff
   ________________/ __/ ___/ ______/ ____________/
     (replaceable)     |     |      |       Qualifier
          |       ARK Label  |      |    (NMA-supported)
          |                  |      |
Name Mapping Authority       |    Name (NAA-assigned)
         (NMA)               |
                  Name Assigning Authority Number (NAAN)

The ARK syntax can be summarized,

  [http://NMA/]ark:/NAAN/Name[Qualifier]

For the UNT Libraries we were assigned a Name Assigning Authority Number (NAAN) of 67531 so all of our identifiers will start like this ark:/67531/

We mint Names for our ARKs locally with a home-grown system locally called a “Number Server” this Python Web service receives a request for a new number, assigns that number a prefix based on which instance we pull from and returns the new Name.

Namespaces

We have four different namespaces that we use for minting identifiers. They are the following, metapth, metadc, metarkv, and coda. Additionally we have a metatest namespace which we use when we need to test things out but it isn’t used that often. Finally we have a historic namespace that is no longer used that is metacrs. Here is the breakdown of how we use these namespaces.

We try to assign all items that end up on The Portal to Texas History with Names from the metapth namespace whenever possible. We assign all other public facing digital objects the metadc namespace. This means that the UNT Digital Library and The Gateway to Oklahoma History both share Names from the metadc namespace. The metarkv namespace is used for “archive only” objects that go directly into our archival repository system, these include large Web archiving datasets. The coda namespace is used within our archival repository called Coda. As was stated earlier the metatest namespace is only used for testing and these items are thrown away after processing.

Name assignment

We assign Names in our systems in programatic ways, this is always done as part of our digital item ingest process. We tend to process items in batches, most often we try to process several hundred items at any given time and sometimes we process several thousand items. When we process items they are processed in parallel and therefore there is no logical order to how the Names are assigned to objects. They are in the order that they were processed but may have no logical order past that.

We also don’t assume that our Names are continuous. If you have an identifier metapth123 and metapth125 we don’t assume that there is an item metapth124, sure it may be there, but it also may never have been assigned. When we first started with these systems we would get worked up if we assigned several hundred or a few thousands identifiers and then had to delete those items, now this isn’t an issue at all but that took some time to get over.

Another assumption that can’t be made in our systems is that if you have an item, Newspaper Vol 1 Issue 2 that has an identifier of metapth333 there is no guarantee that Newspaper Vol. 1 Issue 3 will have metapth334, it might but it isn’t guaranteed either. Another thing that happens in our systems is that items can be shared between systems and the membership to either the Portal, UNT Digital Library or Gateway is notated in the descriptive metadata. Therefore you can’t say all metapth* identifiers are Portal or all metadc* identifiers are not the Portal, you have to look them up based on the metadata.

Once a number is assigned it is never assigned again. This sounds like a silly thing to say but it is important to remember, we don’t try and save identifiers, or reuse them as if we will run out of them.

Level of assignment

We currently assign an ARK identifier at the level of the intellectual object. So for example, a newspaper issue gets and ARK, a photograph gets an ARK, a book, a map, a report, an audio recording, a video recording gets an ARK. The sub-parts of an item are not given further unique identifiers because the way that we tend to interface with them is in the form of formatted URLs such as those described here or from other URL based patterns such as the URLs we use to retrieve items from Coda.

http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/manifest-md5.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/coda_directives.py
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bagit.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bag-info.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/0=untl_aip_1.0
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/queries.xlsx
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/README.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/ba3ce7a1-0e3b-44cb-8b41-5d9d1b0438fe.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/7fe68777-54a2-4c71-95b2-aa33204ae84b.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadc498968.aip.mets.xml

Lessons Learned

Things I would do again.

I would most likely use just an incrementing counter for assigning identifiers. Name minters such as Noid are also an option but I like the numbers with a short prefix.
I would not use a prefix such as UNT do stay away from branding as much as possible. Even metapth is way too branded (see below).

Things I would change in our implementation.

I would only have one namespace for non-archival items. Two namespaces for production data just invite someone to screw up (usually me) and then suddenly the reason for having one namespace over the other is meaningless. Just manage one namespace and move on.
I would not have a six or seven character prefix. metapth and metadc came as baggage from our first system, we decided that the 30k identifiers we already minted had set our path. Now after 1,077,975 identifiers in those namespaces, it seems a little silly that those the first 3% of our items would have such an effect on us still today.
I would not brand our namespaces so closely to our systems names such as metapth, metadc, and the legacy metacrs people read too much into the naming convention. This is a big reason for opaque Names in the first place, and is pretty important.

Things I might change in a future implementation.

I would probably pad my identifiers out to eight digits. While you can’t rely on the ARKs to be generated in a given order, once they are assigned it is helpful to be able to sort by them and have a consistent order, metapth1, metapth100, metapth100000 don’t always sort nicely like metapth00000001, metapth00000100, metapth00100000 do. But then again longer run numbers of zeros are harder to transcribe and I had a tough time just writing this example. Maybe I wouldn’t do this.

I don’t think any of this post applies only to ARK identifiers as most identifier schemes at some level have to have a decision made about how you are going to mint unique names for things. So hopefully this is useful to others.

If you have any specific questions for me let me know on twitter.