Author Archives: vphill

About vphill

you know.. me and stuff

DPLA Description Fields: More statistics (so many graphs)

In the past few posts we looked at the length of the description fields in the DPLA dataset as a whole and at the provider/hub level.

The length of the description field isn’t the only field that was indexed for this work.  In fact I indexed on a variety of different values for each of the descriptions in the dataset.

Below are the fields I currently am working with.

Field Indexed Value Example
dpla_id 11fb82a0f458b69cf2e7658d8269f179
id 11fb82a0f458b69cf2e7658d8269f179_01
provider_s usc
desc_order_i 1
description_t A corner view of the Santa Monica City Hall.; Streetscape. Horizontal photography.
desc_length_i 82
tokens_ss “A”, “corner”, “view”, “of”, “the”, “Santa”, “Monica”, “City”, “Hall”, “Streetscape”, “Horizontal”, “photography”
token_count_i 12
average_token_length_f 5.5833335
percent_int_f 0
percent_punct_f 0.048780486
percent_letters_f 0.81707317
percent_printable_f 1
percent_special_char_f 0
token_capitalized_f 0.5833333
token_lowercased_f 0.41666666
percent_1000_f 0.5
non_1000_words_ss “santa”, “monica”, “hall”, “streetscape”, “horizontal”, “photography”
percent_5000_f 0.6666667
non_5000_words_ss “santa”, “monica”, “streetscape”, “horizontal”
percent_en_dict_f 0.8333333
non_english_words_ss “monica”, “streetscape”
percent_stopwords_f 0.25
has_url_b FALSE

This post will try and pull together some of the data from the different fields listed above and present them in a way that we will hopefully be able to use to derive some meaning from.

More Description Length Discussion

In the previous posts I’ve primarily focused on the length of the description fields.  There are two other fields that I’ve indexed that are related to the length of the description fields.  These two fields include the number of tokens in a description and the average token length of fields.

I’ve included those values below.  I’ve included two mean values, one for all of the descriptions in the dataset (17,884,946 descriptions) and in the other the descriptions that are 1 character in length or more (13,771,105descriptions).

Field Mean – Total Mean – 1+ length
desc_length_i 83.321 108.211
token_count_i 13.346 17.333
average_token_length_f 3.866 5.020

The graphs below are based on the numbers of just descriptions that are 1+ length or more.

This first graph is being reused from a previous post that shows the average length of description by Provider/Hub.  David Rumsey and the Getty are the two that average over 250 characters per description.

Average Description Length by Hub

Average Description Length by Hub

It shouldn’t surprise you that David Ramsey and Getter are two of the Providers/Hubs that have the highest average token counts,  with longer descriptions generally creating more tokens. There are a few differences that don’t match this though,  USC that has an average of just over 50 characters for the average description length comes in as the third highest in the average token counts at over 40 tokens per description.  There are a few other providers/hubs that look a bit different than their average description length.

Average Token Count by Provider

Average Token Count by Provider

Below is a graph of the average token lengths by providers.  The lower the number is the lower average length of a token.  The mean for the entire DPLA dataset for descriptions of length 1+ is just over 5 characters.

Average Token Length by Provider

Average Token Length by Provider

That’s all I have to say about the various statistics related to length for this post.  I swear!. Next we move on to some of the other metrics that I calculated when indexing things.

Other Metrics for the Description Field

Throughout this analysis I had a question of when to take into account that there were millions of records in the dataset that had no description present.  I couldn’t just throw away that fact in the analysis but I didn’t know exactly what to do with them.  So below I present statistics for the average of many of the fields I indexed as both the mean of all of the descriptions and then the mean of just the descriptions that are one or more characters in length.  The graphs that follow the table below are all based on the subset of descriptions that are greater than or equal to one character in length.

Field Mean – Total Mean – 1+ length
percent_int_f 12.368% 16.063%
percent_punct_f 4.420% 5.741%
percent_letters_f 50.730% 65.885%
percent_printable_f 76.869% 99.832%
percent_special_char_f 0.129% 0.168%
token_capitalized_f 26.603% 34.550%
token_lowercased_f 32.112% 41.705%
percent_1000_f 19.516% 25.345%
percent_5000_f 31.591% 41.028%
percent_en_dict_f 49.539% 64.338%
percent_stopwords_f 12.749% 16.557%

Stopwords

Stopwords are words that occur very commonly in natural language.  I used a list of 127 stopwords for this work to help understand what percentage of a description (based on tokens) is made up of stopwords.  While stopwords generally carry little meaning for natural language, they are a good indicator of natural language,  so providers/hubs that have a higher percentage of stopwords would probably have more descriptions that resemble natural language.

Percent Stopwords by Provider

Percent Stopwords by Provider

Punctuation

I was curious about how much punctuation was present in a description on average.  I used the following characters as my set of “punctuation characters”

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I found the number of characters in a description that were made up of these characters vs other characters and then divided the number of punctuation characters by the total description length to get the percentage of the description that is punctuation.

Percent Punctuation by Provider

Percent Punctuation by Provider

Punctuation is common in natural language but it occurs relatively infrequently. For example that last sentence was eighty characters long and only one of them was punctuation (the period at the end of the sentence). That comes to a percent_punctuation of only 1.25%.  In the graph above you will see the the bhl provider/hub has over 50% of their description with 25-49% punctuation.  That’s very high when compared to the other hubs and the fact that there is an average of about 5% overall for the DPLA dataset. Digital Commonwealth has a percentage of descriptions that are from 50-74% punctuation which is pretty interesting as well.

Integers

Next up in our list of things to look at is the percentage of the description field that consists of integers.  For review,  integers are digits,  like the following.

0123456789

I used the same process for the percent integer as I did for the percent punctuation mentioned above.

Percent Integer by Provider

Percent Integer by Provider

You can see that there are several providers/hubs that have quite a high percentage integer for their descriptions.  These providers/hubs are the bhl and the smithsonian.  The smithsonian has over 70% of its descriptions with percent integers of over 70%.

Letters

Once we’ve looked at punctuation and integers,  that leaves really just letters of the alphabet to makeup the rest of a description field.

That’s exactly what we will look at next. For this I used the following characters to define letters.

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

I didn’t perform any case folding so letters with diacritics wouldn’t be counted as letters in this analysis,  but we will look at those a little bit later.

Percent Letter by Provider

Percent Letter by Provider

For percent letters you would expect there to be a very high percentage of the descriptions that themselves contain a high percentage of letters in the description.  Generally this appears to be true but there are some odd providers/hubs again mainly bhl and the smithsonian,  though nypl, kdl and gpo also seem to have a different distribution of letters than others in the dataset.

Special Characters

The next thing to look at was the percentage of “special characters” used in a description.  For this I used the following definition of “special character”.  If a character is not present in the following list of characters (which also includes whitespace characters) then it is considered to be a “special character”

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~  
Percent Special Character by Provider

Percent Special Character by Provider

A note in reading the graph above,  keep in mind that the y-axis is only 95-100% so while USC looks different here it only represents 3% of its descriptions that have 50-100% of the description being special characters.  Most likely a set of descriptions that have metadata created in a non-english language.

URLs

The final graph I want to look at in this post is the percentage of descriptions for a provider/hub that has a URL present in its description.  I used the presence of either http:// or https:// in the description to define if it does or doesn’t have a URL present.

Percent URL by Provider

Percent URL by Provider

The majority providers/hubs don’t have URLs in their descriptions with a few obvious exceptions.  The provider/hubs of washington, mwdl, harvard, gpo and david_ramsey do have a reasonable number of descriptions with URLs with washington leading with almost 20% of their descriptions having a URL present.

Again this analysis is just looking at what high-level information about the descriptions can tell us.  The only metric we’ve looked at that actually goes into the content of the description field to pull out a little bit of meaning is the percent stopwords.  I have one more post in this series before we wrap things up and then we will leave descriptions in the DPLA along for a bit.

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Descriptive Metadata Lengths: By Provider/Hub

In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

provider min max count sum mean stddev
artstor 0 6,868 128,922 9,413,898 73.02 178.31
bhl 0 100 123,472 775,600 6.28 8.48
cdl 0 6,714 563,964 65,221,428 115.65 211.47
david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92
digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09
digitalnc 1 9,785 241,275 45,759,118 189.66 262.89
esdn 0 9,136 197,396 23,620,299 119.66 170.67
georgia 0 12,546 875,158 135,691,768 155.05 210.85
getty 0 2,699 264,268 80,243,547 303.64 273.36
gpo 0 1,969 690,353 33,007,265 47.81 58.20
harvard 0 2,277 23,646 2,424,583 102.54 194.02
hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03
indiana 0 4,477 73,385 6,893,350 93.93 189.30
internet_archive 0 7,685 523,530 41,713,913 79.68 174.94
kdl 0 974 144,202 390,829 2.71 24.95
mdl 0 40,598 483,086 105,858,580 219.13 345.47
missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08
mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51
nara 0 2,000 700,948 1,425,165 2.03 28.13
nypl 0 2,633 1,170,357 48,750,103 41.65 161.88
scdl 0 3,362 159,681 18,422,935 115.37 164.74
smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37
the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95
tn 0 46,312 151,334 30,513,013 201.63 248.79
uiuc 0 4,942 63,412 3,782,743 59.65 172.44
undefined_provider 0 469 11,436 2,373 0.21 6.09
usc 0 29,861 1,076,031 60,538,490 56.26 193.20
virginia 0 268 30,174 301,042 9.98 17.91
washington 0 1,000 42,024 5,258,527 125.13 177.40

This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

Average Description Length

The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

Average Description Length by Hub

Average Description Length by Hub

For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

Mean and Standard Deviation by Partner/Hub

I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

Description Length Mean and Stddev by Hub

Description Length Mean and Stddev by Hub

There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

Longest Description by Partner/Hub

In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

Longest Description by Hub

Longest Description by Hub

Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

Total Description Length of Descriptions by Provider/Hub

Total Description Length of All Descriptions by Hub

Total Description Length of All Descriptions by Hub

Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

Closing for this post

Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

If you have questions or comments about this post,  please let me know via Twitter.

DPLA Description Field Analysis: Yes there really are 44 “page” long description fields.

In my previous post I mentioned that I was starting to take a look at the descriptive metadata fields in the metadata collected and hosted by the Digital Public Library of America.  That last post focused on records, how many records had description fields present, and how many were missing.  I also broke those numbers into the Provider/Hub groupings present in the DPLA dataset to see if there were any patterns.

Moving on the next thing I wanted to start looking at was data related to each instance of the description field.  I parsed each of the description fields, calculated a variety of statistics using that description field and then loaded that into my current data analysis tool, Solr which acts as my data store and my full-text index.

After about seven hours of processing I ended up with 17,884,946 description fields from the 11,654,800 records in the dataset.  You will notice that we have more descriptions than we do records, this is because a record can have more than one instance of a description field.

Lets take a look at a few of the high-level metrics.

Cardinality

I first wanted to find out the cardinality of the lengths of the description fields.  When I indexed each of the descriptions,  I counted the number of characters in the description and saved that as an integer in a field called desc_length_i in the Solr index.  Once it was indexed, it was easy to retrieve the number of unique values for length that were present.  There are 5,287 unique description lengths in the 17,884,946 descriptions that were are analyzing.  This isn’t too surprising or meaningful by itself, just a bit of description of the dataset.

I tried to make a few graphs to show the lengths and how many descriptions had what length.  Here is what I came up with.

Length of Descriptions in dataset

Length of Descriptions in dataset

You can see a blue line barely,  the problem is that the zero length records are over 4 million and the longer records are just single instances.

Here is a second try using a log scale for the x axis

Length of Descriptions in dataset (x axis log)

Length of Descriptions in dataset (x axis log)

This reads a little better I think, you can see that there is a dive down from zero lengths and then at about 10 characters long there is a spike up again.

One more graph to see what we can see,  this time a log-log plot of the data.

Length of Descriptions in dataset (log-log)

Length of Descriptions in dataset (log-log)

Average Description Lengths

Now that we are finished with the cardinality of the lengths,  next up is to figure out what the average description length is for the entire dataset.  This time the Solr StatsComponent is used and makes getting these statistics a breeze.  Here is a small table showing the output from Solr.

min max count missing sum sumOfSquares mean stddev
0 130,592 17,884,946 0 1,490,191,622 2,621,904,732,670 83.32 373.71

Here we see that the minimum length for a description is zero characters (a record without a description present has a length of zero for that field in this model).  The longest record in the dataset is 130,592 characters long.  The total number of characters present in the dataset was nearly one and a half billion characters.  Finally the number that we were after is the average length of a description, this turns out to be 83.32 characters long.

For those that might be curious what 84 characters (I rounded up instead of down) of description looks like,  here is an example.

Aerial photograph of area near Los Angeles Memorial Coliseum, Los Angeles, CA, 1963.

So not a horrible looking length for a description.  It feels like it is just about one sentence long with 13 “words” in this sentence.

Long descriptions

Jumping back a bit to look at the length of the longest description field,  that description is 130,592 characters long.  If you assume that the average single spaced page is 3,000 characters long, this description field is 43.5 pages long.  The reader of this post that has spent time with aggregated metadata will probably say “looks like someone put the full-text of the item into the record”.  If you’ve spent some serious (or maybe not that serious) time in the metadata mines (trenches?) you would probably mumble somethings like “ContentDM grumble grumble” and you would be right on both accounts.  Here is the record on the DPLA site with the 130,492 character long description – http://dp.la/item/40a4f5069e6bf02c3faa5a445656ea61

The next thing I was curious about was the number of descriptions that were “long”.  To answer this I am going to require a little bit of back of the envelope freedom right now to decide what “long” is for a description field in a metadata record.  (In future blog posts I might be able to answer this with different analysis on the data but this hopefully will do for today.)  For now I’m going to arbitrarily decide that anything over 325 characters in length is going to be considered “too long”.

Descriptions: Too Long and Not Too Long

Descriptions: Too Long and Not Too Long

Looking at that pie chart,  there are 5.8% of the descriptions that are “too long” based on my ad-hoc metric from above.  This 5.8% of the records make up 708,050,671 or  48% of the 1,490,191,622 characters in the entire dataset.  I bet if you looked a little harder you would find that the description field gets very close to the 80/20 rule with 20% of the descriptions accounting for 80% of the overall description length.

Short descriptions

Now that we’ve worked with long descriptions, the next thing we should look at are the number of descriptions that are “short” in length.

There are 4,113,841 records that don’t have a description in the DPLA dataset.  This means that for this analysis 4,113,841(23%) of the descriptions have a length of 0.  There are 2,041,527 (11%) descriptions that have a length between 1 and 10 characters in length. Below is the breakdown of these ten counts,  you can see that there is a surprising number (777,887) of descriptions that have a single character as their descriptive contribution to the dataset.

Descriptions 10 characters or less

Descriptions 10 characters or less

There is also an interesting spike at ten characters in length where suddenly we jump to over 500,000 descriptions in the DPLA.

So what?

Now that we have the average length of a description in the DPLA dataset,  the number of records that we consider “long” and the number of records that we consider “short”.  I think the very next question that gets asked is “so what?”

I think there are four big reasons that I’m working on this kind of project with the DPLA data.

One is that the DPLA is the largest aggregation of descriptive metadata in the US for digital resources in cultrual heritage institutions. This is important because you get to take a look at a wide variety of data input rules, practices, and conversions from local systems to an aggregated metadata system.

Secondly this data is licensed with a CC0 license and in a bulk data format so it is easy to grab the data and start working with it.

Thirdly there haven’t been that many studies on descriptive metadata like this that I’m aware of. OCLC will publish analysis on their MARC catalog data from time to time, and the research that was happening at UIUC in the GSILS with IMLS funded metadata isn’t going on anymore (great work to look at by the way)  so there really aren’t that many discussions about using large scale aggregations of metadata to understand the practices in place in cultural heritage institutions across the US.  I am pretty sure that there is work being carried out across the Atlantic with the Eureopana datasets that are available.

Finally I think that this work can lead to metadata quality assurance practices and indicators for metadata creators and aggregators about what may be wrong with their metadata (a message saying “your description is over a page long, what’s up with that?”).

I don’t think there are many answers so far in this work but I feel that they are moving us in the direction of a better understanding of our descriptive metadata world in the context of these large aggregations of metadata.

If you have questions or comments about this post,  please let me know via Twitter.

Beginning to look at the description field in the DPLA

Last year I took a look at the subject field and the date fields in the Digital Public Library of America (DPLA).  This time around I wanted to begin looking at the description field and see what I could see.

Before diving into the analysis,  I think it is important to take a look at a few things.  First off, when you reference the DPLA Metadata Application Profile v4,  you may notice that the description field is not a required field,  in fact the field doesn’t show up in APPENDIX B: REQUIRED, REQUIRED IF AVAILABLE, AND RECOMMENDED PROPERTIES.  From that you can assume that this field is very optional.  Also, the description field when present is often used to communicate a variety of information to the user.  The DPLA data has examples that are clearly rights statements, notes, physical descriptions of the item, content descriptions of the item, and in some instances a place to store identifiers or names. Of all of the fields that one will come into contact in the DPLA dataset,  I would image that the description field is probably one of the ones with the highest variability of content.  So with that giant caveat,  let’s get started.

So on to the data.

The DPLA makes available a data dump of the metadata in their system.  Last year I was analyzing just over 8 million records, this year the collection has grown to more than 11 million records ( 11,654,800 in the dataset I’m using).

The first thing that I had to accomplish was to pull out just the descriptions from the full json dataset that I downloaded.  I was interested in three values for each record, specifically the Provider or “Hub”, the DPLA identifier for the item and finally the description fields.  I finally took the time to look at jq, which made this pretty easy.

For those that are interested here is what I came up with to extract the data I wanted.

zcat all.json.gz | jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (._source.id), 'descriptions': ._source.sourceResource.description?}'

This results in an output that look like this.

{"provider":"http://dp.la/api/contributor/cdl","id":"4fce5c56d60170c685f1dc4ae8fb04bf","descriptions":["Lang: Charles Aikin Collection"]}
{"provider":"http://dp.la/api/contributor/cdl","id":"bca3f20535ed74edb20df6c738184a84","descriptions":["Lang: Maire, graveur."]}
{"provider":"http://dp.la/api/contributor/cdl","id":"76ceb3f9105098f69809b47aacd4e4e0","descriptions":null}
{"provider":"http://dp.la/api/contributor/cdl","id":"88c69f6d29b5dd37e912f7f0660c67c6","descriptions":null}

From there my plan was to write some short python scripts that can read a line, convert it from json into a python object and then do programmy stuff with it.

Who has what?

After parsing the data a bit I wanted to remind myself of the spread of the data in the DPLA collection.  There is a page on the DPLA’s site http://dp.la/partners/ that shows you how many records have been contributed by which Hub in the network.  This is helpful but I wanted to draw a bar graph to give a visual representation of this data.

DPLA Partner Records

DPLA Partner Records

As has been the case since it was added, Hathitrust is the biggest provider of records to the DPLA with other 2.4 million records.  Pretty amazing!

There are three other Hubs/Providers that contribute over 1 million records each,  The Smithsonian, New York Public Library, and the University of Southern California Libraries. Down from there there are three more that contribute over half a million records,  Mountain West Digital Library, National Archives and Records Administration (NARA) and The Portal to Texas History.

There were 11,410 records (coded as undefined_provider) that are not currently associated with a Hub/Provider,  probably a data conversion error somewhere during the record ingest pipeline.

 Which have descriptions

After the reminder about the size and shape of the Hubs/Providers in the DPLA dataset, we can dive right into the data and see quickly how well represented in the data the description field is.

We can start off with another graph.

Percent of Hubs/Providers with and without descriptions

Percent of Hubs/Providers with and without descriptions

You can see that some of the Hubs/Providers have very few records (< 2%) with descriptions (Kentucky Digital Library, NARA) while others had a very high percentage (> 95%) of records with description fields present (David Rumsey, Digital Commonwealth, Digital Library of Georgia, J. Paul Getty Trust, Government Publishing Office, The Portal to Texas History, Tennessee Digital Library, and the University of Illinois at Urbana-Champaign).

Below is a full breakdown for each Hub/Provider showing how many and what percentage of the records have zero descriptions, or one or more descriptions.

Provider Records 0 Descriptions 1+ Descriptions 0 Descriptions % 1+ Descriptions %
artstor 107,665 40,851 66,814 37.94% 62.06%
bhl 123,472 64,928 58,544 52.59% 47.41%
cdl 312,573 80,450 232,123 25.74% 74.26%
david_rumsey 65,244 168 65,076 0.26% 99.74%
digital-commonwealth 222,102 8,932 213,170 4.02% 95.98%
digitalnc 281,087 70,583 210,504 25.11% 74.89%
esdn 197,396 48,660 148,736 24.65% 75.35%
georgia 373,083 9,344 363,739 2.50% 97.50%
getty 95,908 229 95,679 0.24% 99.76%
gpo 158,228 207 158,021 0.13% 99.87%
harvard 14,112 3,106 11,006 22.01% 77.99%
hathitrust 2,474,530 1,068,159 1,406,371 43.17% 56.83%
indiana 62,695 18,819 43,876 30.02% 69.98%
internet_archive 212,902 40,877 172,025 19.20% 80.80%
kdl 144,202 142,268 1,934 98.66% 1.34%
mdl 483,086 44,989 438,097 9.31% 90.69%
missouri-hub 144,424 17,808 126,616 12.33% 87.67%
mwdl 932,808 57,899 874,909 6.21% 93.79%
nara 700,948 692,759 8,189 98.83% 1.17%
nypl 1,170,436 775,361 395,075 66.25% 33.75%
scdl 159,092 33,036 126,056 20.77% 79.23%
smithsonian 1,250,705 68,871 1,181,834 5.51% 94.49%
the_portal_to_texas_history 649,276 125 649,151 0.02% 99.98%
tn 151,334 2,463 148,871 1.63% 98.37%
uiuc 18,231 127 18,104 0.70% 99.30%
undefined_provider 11,422 11,410 12 99.89% 0.11%
usc 1,065,641 852,076 213,565 79.96% 20.04%
virginia 30,174 21,081 9,093 69.86% 30.14%
washington 42,024 8,838 33,186 21.03% 78.97%

With so many of the Hub/Providers having a high percentage of records with descriptions, I was curious about the overall records in the DPLA.  Below is a pie chart that shows you what I found.

DPLA records with and without descriptions

DPLA records with and without descriptions

Almost 2/3 of the records in the DPLA have at least one description field, this is more than I would have expected for an un-required, un-recommended field, but I think this is probably a good thing.

Descriptions per record

The final thing I wanted to look at in this post was the average number of description fields for each of the Hubs/Providers.  This time we will start off with the data table below.

Provider Providers min median max mean stddev
artstor 107,665 0 1 5 0.82 0.84
bhl 123,472 0 0 1 0.47 0.50
cdl 312,573 0 1 10 1.55 1.46
david_rumsey 65,244 0 3 4 2.55 0.80
digital-commonwealth 222,102 0 2 17 2.01 1.15
digitalnc 281,087 0 1 19 0.86 0.67
esdn 197,396 0 1 1 0.75 0.43
georgia 373,083 0 2 98 2.32 1.56
getty 95,908 0 2 25 2.75 2.59
gpo 158,228 0 4 65 4.37 2.53
harvard 14,112 0 1 11 1.46 1.24
hathitrust 2,474,530 0 1 77 1.22 1.57
indiana 62,695 0 1 98 0.91 1.21
internet_archive 212,902 0 2 35 2.27 2.29
kdl 144,202 0 0 1 0.01 0.12
mdl 483,086 0 1 1 0.91 0.29
missouri-hub 144,424 0 1 16 1.05 0.70
mwdl 932,808 0 1 15 1.22 0.86
nara 700,948 0 0 1 0.01 0.11
nypl 1,170,436 0 0 2 0.34 0.47
scdl 159,092 0 1 16 0.80 0.41
smithsonian 1,250,705 0 2 179 2.19 1.94
the_portal_to_texas_history 649,276 0 2 3 1.96 0.20
tn 151,334 0 1 1 0.98 0.13
uiuc 18,231 0 3 25 3.47 2.13
undefined_provider 11,422 0 0 4 0.00 0.08
usc 1,065,641 0 0 6 0.21 0.43
virginia 30,174 0 0 1 0.30 0.46
washington 42,024 0 1 1 0.79 0.41

This time with an image

Average number of descriptions per record

Average number of descriptions per record

You can see that there are several Hubs/Providers a have multiple descriptions per record,  with the Government Publishing Office coming in at 4.37 descriptions per record.

I found it interesting that when you exclude the two Hubs/Providers that don’t really do descriptions (KDL and NARA) you see two that have a very low standard deviation from their mean (average) Tennessee Digital Library at 0.13 and The Portal to Texas History at 0.20 don’t drift much from their almost one description-per-record for Tennessee and almost two descriptions-per-record for Texas. It makes me think that this is probably a set of records that each of those Hubs/Providers would like to have identified so they could go in and add a few descriptions.

Closing

Well that wraps up this post that I hope is the first in a series of posts about the description field in the DPLA dataset.  In subsequent posts we will move away from record level analysis of description fields and get down to the field level to do some analysis of the descriptions themselves.  I have a number of predictions but I will hold onto those for now.

If you have questions or comments about this post,  please let me know via Twitter.

How many of the EOT2008 PDF files were harvested in EOT2012

In my last post I started looking at some of the data from the End of Term 2012 Web Archive snapshot that we have at the UNT Libraries.  For more information about EOT2012 take a look at that previous post.

EOT2008 PDFs

From the EOT2008 Web archive I had extracted the 4,489,675 unique (by hash) PDF files that were present and carried out a bit of analysis on them as a whole to see if there was anything interesting I could tease out.  The results of that investigation I presented at an IS&T Archiving conference a few years back.  The text from the proceedings for that submission is here and the slides presented are here.

Moving forward several years,  I was curious to see how many of those nearly 4.5 million PDFs were still around in 2012 when we crawled the federal Web again as part of the EOT2012 project.

I used the same hash dataset from the previous post to do this work which made things very easy.  I first pulled the hash values for the 4.489,675 PDF files from EOT2008.  Next I loaded all of the hash values from the EOT2012 crawls. The next and final step was to iterate through each of the PDF file hashes and do a lookup to see if that content hash is present in the EOT2012 hash dataset.  Pretty straightforward.

Findings

After the numbers finished running,  it looks like we have the following.

  PDFs Percentage
Found 774,375 17%
Missing 3,715,300 83%
Total 4,489,675 100%

Put into a pie chart where red equals bad.

EOT2008 PDFs in EOT2012 Archive

EOT2008 PDFs in EOT2012 Archive

So 83% of the PDF files that were present in 2008 are not present in the EOT2012 Archive.

With a little work it wouldn’t be hard to see how many of these PDFs are still present on the web today at the same URL as in 2008.  I would imagine it is a much smaller number than the 17%.

A thing to note about this is that because I am using content hashes and not URLs, it is possible that an EOT2008 PDF is available at a different URL entirely in 2012 when it was harvested again. So the URL might not be available but the content could be available at another location.

If you have questions or comments about this post,  please let me know via Twitter.

Poking at the End of Term 2012 Presidential Web Archive

In preparation for some upcoming work with the End of Term 2016 crawl and a few conference talks I should be prepared for, I thought it might be a good thing to start doing a bit of long-overdue analysis of the End of Term 2012 (EOT2012) dataset.

A little bit of background for those that aren’t familiar with the End of Term program.  Back in 2008 a group of institutions got together to collaboratively collect a snapshot of the federal government with a hope to preserve the transition from the Bush administration into what became the Obama administration.  In 2012 this group added a few additional partners and set out to take another snapshot of the federal Web presence.

The EOT2008 dataset was studied as part of a research project funded by IMLS but the EOT2012 really hasn’t been looked at too much since it was collected.

As part of the EOT process, there are several institutions that crawl data that is directly relevant to their collection missions and then we all share what we collect with the group as a whole for any of the institutions who are interested in acquiring a set of the entire collected EOT archive.  In 2012 the Internet Archive, Library of Congress and the UNT Libraries were the institutions that committed resources to crawling. UNT also was interested in acquiring this archive for its collection which is why I have a copy locally.

For the analysis that I am interested in doing for this blog post, I took a copy of the combined CDX files for each of the crawling institutions as the basis of my dataset.  There was one combined CDX for each of IA, LOC, and UNT.

If you look at the three CDX files to see how many total lines are present, this can give you the number of URLs in the collection pretty easily.  This ends up being the following

Collecting Org Total CDX Entries % of EOT2012 Archive
IA 80,083,182 41.0%
LOC 79,108,852 40.5%
UNT 36,085,870 18.5%
Total 195,277,904 100%

Here is how that looks as a pie chart.

EOT2016 Collection Distribution

EOT2016 Collection Distribution

If you pull out all of the content hash values you get the number of “unique files by content hash” in the CDX file. By doing this you are ignoring repeat captures of the same content on different dates, as well the same content occurring at different URL locations on the same or on different hosts.

Collecting Org Unique CDX Hashes % of EOT2012 Archive
IA 45,487,147 38.70%
LOC 50,835,632 43.20%
UNT 25,179,867 21.40%
Total 117,616,637 100.00%

Again as a pie chart

Unique hash values

Unique hash values

It looks like there was a little bit of change in the percentages of unique content with UNT and LOC going up a few percentage points and IA going down.  I would guess that this is to do with the fact that for the EOT projects,  the IA conducted many broad crawls at multiple times during the project that resulted in more overlap.

Here is a table that can give you a sense of how much duplication (based on just the hash values) there is in each of the collections and then overall.

Collecting Org Total CDX Entries Unique CDX Hashes Duplication
IA 80,083,182 45,487,147 43.20%
LOC 79,108,852 50,835,632 35.70%
UNT 36,085,870 25,179,867 30.20%
Total 195,277,904 117,616,637 39.80%

You will see that UNT has the least duplication (possibly more focused crawls with less repeating) than IA (broader with more crawls of the same data?)

Questions to answer.

There were three questions that I wanted to answer for this look at the EOT data.

  1. How many hashes are common across all CDX files
  2. How many hashes are unique to only one CDX file
  3. How many hashes are shared by two CDX files but not by the third.

Common Across all CDX files

The first was pretty easy to answer and just required taking all three lists of hashes, and identifying which hash appears in each list (intersection).

There are only 237,171 (0.2%) hashes shared by IA, LOC and UNT.

Content crawled by all three

Content crawled by all three

You can see that there is a very small amount of content that is present in all three of the CDX files.

Unique Hashes to one CDX file

Next up was number of hashes that were unique to a collecting organizations CDX file. This took two steps, first I took the difference of two hash sets, took that resulting set and took the difference from the third set.

Collecting Org Unique Hashes Unique to Collecting Org Percentage Unique
IA 45,487,147 42,187,799 92.70%
LOC 50,835,632 48,510,991 95.40%
UNT 25,179,867 23,269,009 92.40%
Unique to a collecting org

Unique to a collecting org

It appears that there is quite a bit of unique content in each of the CDX files.  With over 92% or more of the content being unique to the collecting organization.

Common between two but not three CDX files

The final question to answer was how much of the content is shared between two collecting organizations but not present in the third’s contribution.

Shared by: Unique Hashes
Shared by IA and LOC but not UNT 1,737,980
Shared by IA and UNT but not LOC 1,324,197
Shared by UNT and LOC but not IA 349,490

Closing

Unique and shared hashes

Unique and shared hashes

Based on this brief look at how content hashes are distributed across the three CDX files that make up the EOT2012 archive, I think a takeaway is that there is very little overlap between the crawling that these three organizations carried out during the EOT harvests.  Essentially 97% of content hashes are present in just one repository.

I don’t think this tells all of the story though.  There are quite a few caveats that need to be taken into account.  First of all this only takes into account the content hashes that are included in the CDX files.  If you crawl a dynamic webpage and it prints out the time each time you visit the page, you will get a different content hash.  So “unique” is only in the eyes of the hash function that is used.

There are quite a few other bits of analysis that can be done with this data, hopefully I’ll get around to doing a little more in the next few weeks.

If you have questions or comments about this post,  please let me know via Twitter.

Identify outliers: Building a user interface feature.

Background:

At work we are deep in the process of redesigning the user interface of The Portal to Texas History.  We have a great team in our User Interfaces Unit that I get to work with on this project,  they do the majority of the work and I have been a data gatherer to identify problems that come up in our data.

As we are getting closer to our beta release we had a new feature we wanted to add to the collection and partner detail pages.  Below is the current mockup of this detail page.

Collection Detail Mockup

Collection Detail Mockup

Quite long isn’t it.  We are trying something out (more on that later)

The feature that we are wanting more data for is the “At a Glance” feature. This feature displays the number of unique values (cardinality) of a specific field for the collection or partner.

At A Glance Detail

At A Glance Detail

So in the example above we show that there are 132 items, 1 type, 3 titles, 1 contributing partner, 3 decades and so on.

All this is pretty straight forward so far.

The next thing we want to do is to highlight a box in a different color if it is a value that is different from the normal.  For example if the average collection has three different languages present then we might want to highlight the language box for a collection that had ten languages represented.

There are several ways that we can do this, first off we just made some guesses and coded in values that we felt would be good thresholds.  I wanted to see if we could figure out a way to identify these thresholds based on the data in the collection itself.  That’s what this blog post is going to try to do.

Getting the data:

First of all I need to pull out my “I couldn’t even play an extra who stands around befuddled on a show about statistics, let alone play a stats person on TV” card (wow I really tried with that one) so if you notice horribly incorrect assumptions or processes here, 1. you are probably right, and 2. please contact me so I can figure out what I’m doing wrong.

That being said here we go.

We currently have 453 unique collections in The Portal to Texas History.  For each of these collections we are interested in calculating the cardinality of the following fields

  • Number of items
  • Number of languages
  • Number of series titles
  • Number of resource types
  • Number of countries
  • Number of counties
  • Number of states
  • Number of decades
  • Number of partner institutions
  • Number of items uses

To calculate these numbers I pulled data from our trusty Solr index making use of the stats component and the stats.calcdistinct=true option.  Using this I am able to get the number of unique values for each of the fields listed above.

Now that I have the numbers from Solr I can format them into lists of the unique values and start figuring out how I want to define a threshold.

Defining a threshold:

For this first attempt I decided to try and define the threshold using the Tukey Method that uses the Interquartile Range (IQR).  If you never took any statistics courses (I was a music major so not much math for me) I found this post Highlighting Outliers in your Data with the Tukey Method extremely helpful.

First off I used the handy st program to get an overview of the data that I was going to be working with.

Field N min q1 median q3 max sum mean stddev stderr
items 453 1 98 303 1,873 315,227 1,229,840 2,714.87 16,270.90 764.47
language 453 1 1 1 2 17 802 1.77 1.77 0.08
titles 453 0 1 1 3 955 5,082 11.22 65.12 3.06
type 453 1 1 1 2 22 1,152 2.54 3.77 0.18
country 453 0 1 1 1 73 1,047 2.31 5.59 0.26
county 453 0 1 1 7 445 8,901 19.65 53.98 2.54
states 453 0 1 1 2 50 1,902 4.20 8.43 0.40
decade 453 0 2 5 9 49 2,759 6.09 5.20 0.24
partner 453 1 1 1 1 103 1,007 2.22 7.22 0.34
uses 453 5 3,960 17,539 61,575 10,899,567 50,751,800 112,035 556,190 26,132.1

With the q1 and q3 values we can calculate the IQR for the field and then using the standard 1.5 multiplier or the extreme multiplier of 3 we can add this value back to the q3 value and find our upper threshold.

So for the county field

7 - 1 = 6
6 * 1.5 = 9
7 + 9 = 16

This gives us the threshold values in the table below.

Field Threshold – 1.5 Threshold – 3
items 4,536 7,198
language 4 5
titles 6 9
type 4 5
country 1 1
county 16 25
states 4 5
decade 20 30
partner 1 1
uses 147,997 234,420

Moving forward we can use these thresholds as a way of saying “this field stands out in this collection from other collections”  and make the box in the “At a Glance” feature a different color.

If you have questions or comments about this post,  please let me know via Twitter.

Portal to Texas History Newspaper OCR Text Datasets

Overview:

A week or so ago I had a faculty member at UNT ask if I could work with one of his students to get a copy of the OCR text of several titles of historic Texas newspapers that we have on The Portal to Texas History.

While we provide public access to the full-text for searching and discovering newspapers pages of interest to users, we don’t have a very straightforward way to publicly obtain the full-text for a given issue let along full titles that may be many tens of thousands of pages in size.

At the end of the week I had pulled roughly 79,000 issues of newspapers comprised of over 785,000 pages of OCR text. We are making these publicly available in the UNT Data Repository under a CC0 License so that others might be able to make use of them.  Feel free to jump over to the UNT Digital Library to grab a copy.

Background:

The UNT Libraries and The Portal to Texas History have operated the Texas Digital Newspaper Program for nine years with a goal of preserving and making available as many newspapers published in Texas as we are able to collect and secure rights to.  At this time we have nearly 3.5 million pages of Texas newspapers ranging from the 1830’s all the way to 2015. Jump over to the TDNP collection in the Portal to take a look at all of the content there including a list of all of the titles we have digitized.

The titles in the datasets were chosen by the student and professor and seem to be a fairly decent sampling of communities that we have in the Portal that are both large in size and have a significant number of pages of newspapers digitized.

Here is a full list of the communities, page count, issue count, and links to the dataset itself in the UNT Digital Library.

Dataset Name Community County Years Covered Issues Pages
Portal to Texas History Newspaper OCR Text Dataset: Abilene Abilene Taylor County 1888-1923 7,208 62,871
Portal to Texas History Newspaper OCR Text Dataset: Brenham Brenham Washington County 1876-1923 10,720 50,368
Portal to Texas History Newspaper OCR Text Dataset: Bryan Bryan Brazos County 1883-1922 5,843 27,360
Portal to Texas History Newspaper OCR Text Dataset: Denton Denton Denton County 1892-1911 690 4,686
Portal to Texas History Newspaper OCR Text Dataset: El Paso El Paso El Paso County 1881-1921 17,104 177,640
Portal to Texas History Newspaper OCR Text Dataset: Fort Worth Fort Worth Tarrant County 1883-1896 4,146 36,199
Portal to Texas History Newspaper OCR Text Dataset: Gainesville Gainesville Cooke County 1888-1897 2,286 9,359
Portal to Texas History Newspaper OCR Text Dataset: Galveston Galveston Galveston County 1849-1897 8,136 56,953
Portal to Texas History Newspaper OCR Text Dataset: Houston Houston Harris County 1893-1924 9,855 184,900
Portal to Texas History Newspaper OCR Text Dataset: McKinney McKinney Collin County 1880-1936 1,568 12,975
Portal to Texas History Newspaper OCR Text Dataset: San Antonio San Antonio Bexar County 1874-1920 6,866 130,726
Portal to Texas History Newspaper OCR Text Dataset: Temple Temple Bell County 1907-1922 4,627 44,633

Dataset Layout

Each of the datasets is a gzipped tar file that contains a multi-level directory structure.  In addition there is a README.txt created for each of the datasets. Here is an example of the Denton README.txt

Each of the datasets is organized by title. Here is the structure for the Denton dataset.

Denton
└── data
    ├── Denton_County_News
    ├── Denton_County_Record_and_Chronicle
    ├── Denton_Evening_News
    ├── Legal_Tender
    ├── Record_and_Chronicle
    ├── The_Denton_County_Record
    └── The_Denton_Monitor

Within each of the title folders are subfolders for each year that we have a newspaper issue for.

Denton/data/Denton_County_Record_and_Chronicle/
├── 1898
├── 1899
├── 1900
└── 1901

Finally within each of the year folders contain folders for each issue present in The Portal to Texas History on the day the dataset was extracted.

Denton
└── data
    ├── Denton_County_News
    │   ├── 1892
    │   │   ├── 18920601_metapth502981
    │   │   ├── 18920608_metapth502577
    │   │   ├── 18920615_metapth504880
    │   │   ├── 18920622_metapth504949
    │   │   ├── 18920629_metapth505077
    │   │   ├── 18920706_metapth501799
    │   │   ├── 18920713_metapth502501
    │   │   ├── 18920720_metapth502854

Each of these issue folders has the date of publication in the yyyymmdd format and the ARK identifier from the Portal for the folder name.

Each of these folders is a valid BagIt bag that can be verified with tools like bagit.py. Here is the structure for an issue.

18921229_metapth505423
├── bag-info.txt
├── bagit.txt
├── data
│   ├── metadata
│   │   ├── ark
│   │   ├── metapth505423.untl.xml
│   │   └── portal_ark
│   └── text
│       ├── 0001.txt
│       ├── 0002.txt
│       ├── 0003.txt
│       └── 0004.txt
├── manifest-md5.txt
└── tagmanifest-md5.txt

The OCR text is located in the text folder and three metadata files are present in the metadata folder. A file called ark that contains the ark identifier for this item. There is a file called portal_ark that contains the URL to this issue in The Portal to Texas History, and finally a metadata file in the UNTL metadata format.

I hope that these datasets are useful to folks interested in trying their hand at working with a large collection of OCR text from newspapers. I should remind everyone that this is uncorrected OCR text and will most likely need a fair bit of pre-processing because it is far from perfect.

If you have questions or comments about this post,  please let me know via Twitter.

Finding figures and images in Electronic Theses and Dissertations (ETD)

One of the things that we are working on at UNT is a redesign of The Portal to Texas History’s interface.  In doing so I’ve been looking around quite a bit at other digital libraries to get ideas of features that we could incorporate into our new user experience.

One feature that I found that looked pretty nifty was the “peek” interface for the Carolina Digital Repository. They make the code for this interface available to others to use if they are interested via the UNC Libraries GitHub in the peek repository.  I think this is an interesting interface but I had the question still of “how did you decide which images to choose”.  I came across the peek-data repository that suggested that the choosing of images was a manual process, and I also found a powerpoint presentation titled “A Peek Inside the Carolina Digital Repository” by Michael Daines that confirmed this is the case.  These slides are a few years old so I don’t know if the process is still manual.

I really like this idea and would love to try and implement something similar for some of our collections but the thought of manually choosing images doesn’t sound like fun at all.  I looked around a bit to see if I could borrow from some prior work that others have done.  I know that the Internet Archive and the British Library have released some large image datasets that appear to be the “interesting” images from books in their collections.

Less and More interesting images

Less and More interesting images

I ran across a blog post by Chris Adams who works on the World Digital Library at the Library of Congress called “Extracting images from scanned book pages” that seemed to be close to what I wanted to do,  but wasn’t exactly it either.

I remembered back to a Code4Lib Lightning Talk a few years back from Eric Larson called “Finding image in book page images” and the companion GitHub repository picturepages that contains the code that he used.   In reviewing the slides and looking at the code I think I found what I was looking for,  at least a starting point.

Process

What Eric proposed for finding interesting images was that you would take an image, convert it to grayscale, increase the contrast dramatically, convert this new images into a single pixel wide image that is 1500 pixels tall and sharpen the image.  That resulting image would be inverted,  have a threshold applied to it to convert everything to black or white pixels and then it would be inverted again.  Finally the resulting values of either black or white pixels are analyzed to see if there are areas of the image that are 200 or more pixels long that are solid black.

convert #{file} -colorspace Gray -contrast -contrast -contrast -contrast -contrast -contrast -contrast -contrast -resize 1X1500! -sharpen 0x5 miff:- | convert - -negate -threshold 0 -negate TXT:#{filename}.txt`

The script above which uses ImageMagick to convert an input image to greyscale, calls contrast eight times, resizes the image and the sharpens the result. It pipes this file into convert again, flips the colors, applies and threshold and flips back the colors. The output is saved as a text file instead of an image, with one line per pixel. The output looks like this.

# ImageMagick pixel enumeration: 1,1500,255,srgb
...
0,228: (255,255,255)  #FFFFFF  white
0,229: (255,255,255)  #FFFFFF  white
0,230: (255,255,255)  #FFFFFF  white
0,231: (255,255,255)  #FFFFFF  white
0,232: (0,0,0)  #000000  black
0,233: (0,0,0)  #000000  black
0,234: (0,0,0)  #000000  black
0,235: (255,255,255)  #FFFFFF  white
0,236: (255,255,255)  #FFFFFF  white
0,237: (0,0,0)  #000000  black
0,238: (0,0,0)  #000000  black
0,239: (0,0,0)  #000000  black
0,240: (0,0,0)  #000000  black
0,241: (0,0,0)  #000000  black
...

The next step was to loop through each of the lines in the file to see if there was a sequence of 200 black pixels.

I pulled a set of images from an ETD that we have in the UNT Digital Library and tried a Python port of Eric’s code that I hacked together.  For me things worked pretty well, it was able to identify the images that I would have manually pulled as pages that were “interesting” on my own.

But there was a problem that I ran into,  the process was pretty slow.

I pulled a few more sets of page images from ETDs and found that for those images it would take the ImageMagick convert process up to 23 seconds per images to create the text files that I needed to work with.  This made me ask if I could actually implement this same sort of processing workflow with just Python.

I need a Pillow

I have worked with the Python Image Library (PIL) a few times over the years and had a feeling it could do what I was interested in doing.  I ended up using Pillow which is a “friendly fork” of the original PIL library.  My thought was to apply the same processing workflow as was carried out in Eric’s script and see if doing it all in python would be reasonable.

I ended up with an image processing workflow that looks like this:

# Open image file
im = Image.open(filename)

# Convert image to grayscale image
g_im = ImageOps.grayscale(im)

# Create enhanced version of image using aggressive Contrast
e_im = ImageEnhance.Contrast(g_im).enhance(100)

# resize image into a tiny 1x1500 pixel image
# ANTIALIAS, BILINEAR, and BICUBIC work, NEAREST doesn't
t_im = e_im.resize((1, 1500), resample=Image.BICUBIC)

# Sharpen skinny image file
st_im = t_im.filter(ImageFilter.SHARPEN)

# Invert the colors
it_im = ImageOps.invert(st_im)

# If a pixel isn't black (0), make it white (255)
fixed_it_im = it_im.point(lambda x: 0 if x < 1 else 255, 'L')

# Invert the colors again
final = ImageOps.invert(fixed_it_im)

final.show()

I was then able to iterate through the pixels in the final image with the getdata() method and apply the same logic of identifying images that have sequences of black pixels that were over 200 pixels long.

Here are some examples of thumbnails from three ETDs,  first all images and then just the images identified by the above algorithm as “interesting”.

Example One

Thumbnails for ark:/67531/metadc699990/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699990/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

 

 

Example Two

Thumbnails for ark:/67531/metadc699999/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

Example Three

Thumbnails for ark:/67531/metadc699991/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699991/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699991/ with just visually interesting pages.

Thumbnails for ark:/67531/metadc699991/ with just visually interesting pages.

So in the end I was able to implement the code in Python with Pillow and a fancy little lambda function.  The speed was much improved as well.  For those same images that were taking up to 23 seconds to process with the ImageMagick version of the workflow,  I was able to process them in a tiny bit over a second with this Python version.

The full script I was using for these tests is below. You will need to download and install Pillow in order to get it to work.

I would love to hear other ideas or methods to do this kind of work, if you have thoughts, suggestions, or if I missed something in my thoughts, please let me know via Twitter.

 

Finding collections that have the “first of the month” problem.

A few weeks ago I wrote a blog post exploring the dates in the UNT Libraries’ Digital Collections. I got a few responses from that post on Twitter with one person stating that the number of Jan 1, publication dates seemed high and maybe there was something a little fishy there.  After looking at the numbers a little more, I think they were absolutely right, there was something fishy going on.

First day of the month problem.

First day of the month problem.

I worked up the graphic above to try and highlight what the problem is.  In looking above you can see that there are very large spikes on the first day of each of the months (you may have to look really closely at that image to see the dates) and a large spike on the last day of the year, December 31.

I created this graphic by taking the 967,257 dates that in the previous post I classified as “Day” and stripped the year.  I then counted the number of times a date occurred, like 01-01 for Jan 1, or 03-04 for March 4, and plotted those to the graph.

Problem Identified, Now What?

So after I looked at that graph,  I got sad… so many dates that might be wrong that we would need to change.  I guess part of the process of fixing metadata is to know if there is something to fix.  The next thing I wanted to do was to figure out which collections had a case of the “first day of the month” and which collections didn’t have this problem.

I decided to apply my horribly limited knowledge of statistics and my highly developed skills with Google to come up with some way of identifying these collections programatically. We currently have 770 different collections in the UNT Libraries’ Digital Collections and I didn’t want to go about this by hand.

So my thought was that if I was to calculate the linear regression for a month of data I would be able to use the slope of the regression in identifying collections that might have issues.  Once again I grouped all months together, so if we had a 100 year run of newspapers, all of those published on January would be together, just as April, and December.  This left me with twelve slope values per collection.  Some of the slopes were negative numbers and some were positive.  I decided that I would take the average of the absolute values of each of these slopes to come up with my first metric.

Here are the top ten collections and their absolute slope average.

Collection Name Collection Code Avg. Abs Slope
Office of Scientific & Technical Information Technical Reports OSTI 19.03
Technical Report Archive and Image Library TRAIL 4.47
National Advisory Committee for Aeronautics Collection NACA 4.45
Oklahoma Publishing Company Photography Collection OKPCP 4.03
Texas Digital Newspaper Program TDNP 2.76
Defense Base Closure and Realignment Commission BRAC 1.25
United States Census Map Collection USCMC 1.06
Abilene Library Consortium ABCM 0.99
Government Accountability Office Reports GAORT 0.95
John F. Kennedy Memorial Collection JFKAM 0.78

Obviously the OSTI collection has the highest Average Absolute Slope metric at 19.03. Next comes TRAIL at 4.47 and NACA at 4.45.  It should be noted that the NACA collection is a subset of the TRAIL collection so there is some influence in the numbers from NACA onto the TRAIL collection.  Then we have the OKPCP collection at 4.03.

In looking at the top six collections listed in the above table,  I can easily see how they could run into this “first of the month” problem.  OSTI, NACA and BRAC were all created from documents harvested from federal websites.  I can imagine that in situations where they were entering metadata, the tools they were using may have required a full date in the format of mm/dd/yy or yyyy-mm-dd,  if the month is the only thing designated on the report you would mark it as being the first of that month so that the date would validate.

The OKPCP and TDNP collections have similar reasons as to why they would have an issue.

I used matplotlib to plot the monthly slopes to a graphic so that I could see what was going on. Here is a graphic for the OSTI collection.

OSTI - Monthly Slopes

OSTI – Monthly Slopes

In contrast to the OSTI Monthly Slopes graphic above,  here is a graphic of the WLTC collection that has an Average Absolute Slope of 0.000352 (much much smaller than OSTI)

WLTC - Monthly Slopes

WLTC – Monthly Slopes

When looking at these you really have to pay attention to the scale of each of the subplots in order to see how much the slopes of the OSTI – Monthly Slopes are really falling or rising.

Trying something a little different.

The previous work was helpful in identifying which of the collections had the biggest “first day of the month” problems.  I wasn’t too surprised with the results I got from the top ten.  I wanted to normalize the numbers a bit to see if I could tease out some of the collections that had smaller numbers of items that might also have this problem but were getting overshadowed by the large OSTI collection (74,000+ items) or the TRAIL collection (18,000+ items).

I went about things in a similar fashion but this time I decided to work with the percentages for each day of a month instead of a raw count.

For the month of January for the NACA collection,  here is what the difference would be for the calculations.

Day of the Month Item County Percentage of Total
1 2,249 82%
2 9 0%
3 3 0%
4 12 0%
5 9 0%
6 9 0%
7 19 1%
8 9 0%
9 18 1%
10 25 1%
11 14 1%
12 20 1%
13 18 1%
14 21 1%
15 18 1%
16 28 1%
17 21 1%
18 18 1%
19 19 1%
20 24 1%
21 24 1%
22 22 1%
23 8 0%
24 20 1%
25 18 1%
26 8 0%
27 10 0%
28 24 1%
29 15 1%
30 16 1%
31 9 0%

Instead of the “Item County” I would use the “Percentage of Total” for the calculation of the slope and the graphics that I would generate. Hopeful that I would be able to uncover some different collections this way.

Below is the table for the top ten collections and their Average Absolute Slope based on Precent of items for a given month.

Collection Name Collection Name Avg. Abs Slope of %
Age Index AGE 0.63
Fraternity FRAT 0.63
The Indian Advocate (Sacred Heart, OK) INDIAN 0.63
Southwest Chinese Journal SWCJ 0.58
Benson Latin American Collection BLA 0.39
National Advisory Committee for Aeronautics Collection NACA 0.38
Technical Report Archive and Image Library TRAIL 0.36
Norman Dietel Photograph Collection NDLPC 0.34
Boone Collection Bank Notes BCBN 0.33
Office of Scientific & Technical Information Technical Reports OSTI 0.32

If you compare this to the first table in the post you will see that there are some new collections present. The first four of these are actually newspaper collections and in the case of several of them, they consist of issues that were published on a monthly basis but were notated during the digitization as being published on the first of the month because of some of the data structures that were in place for the digitization process. So we’ve identified more collections that have the “first day of the month problem”

FRAT - Percentage Range

FRAT – Percentage Range

You can see that there is a consistent slope from the upper left to lower right on each of the months of the FRAT collection.  For me this signifies a collection that may be suffering from the “first day of the month” problem.  A nice thing about using the percentages instead of the counts directly is that we are able to find collections that are much smaller in terms  of numbers, for example the FRAT has only 22 records.  If we just used the counts directly these might get lost because they would have a smaller slope than that of OSTI which has many many more records.

For good measure here is the plot of the OSTI records so you can see how it differs from the count based plots.

OSTI - Percentage Range

OSTI – Percentage Range

You can see that it retained the overall shape of the slopes but it doesn’t clobber the smaller collections when you try to find collections that have issues.

Closing

I fully expect that I misused some of the math in this work or missed other obvious ways to accomplish a similar result.  If a I did, do get in touch and let me know.

I think that this is a good start as a set of methods to identify collections in the UNT Libraries’ Digital Collections that suffer from the “first day of the month problem” and once identified it is just a matter of time and some effort to get these dates corrected in the metadata.

Hope you enjoyed my musings here, if you have thoughts, suggestions, or if I missed something in my thoughts,  please let me know via Twitter.