Monthly Archives: January 2017

UNT Libraries’ Digital Collections 2016 in Review: Items

This post is just an overview of the 2016 year for the UNT Libraries’ Digital Collections.  I have wanted to do one of these for a number of years now but never really got around to it.  So here we go.

I plan to look at two areas of activity for the digital collections.  Content added, usage, and some info on metadata curation activities.  This first post will focus on items added.

Items added

From January 1, 2016 until December 31, 2016 we added a total of 295,077 new items to the UNT Libraries’ Digital Collections.  The UNT Libraries’ Digital Collections encompasses The Portal to Texas History, the UNT Digital Library, and the Gateway to Oklahoma History.  The graphic below shows the number of records added to each of the systems throughout the year.

Items Added by System

The Portal to Texas History (PTH in the chart) had the most items added at 145,268 new items.  This was followed by the UNT Digital Library (DC in the chart) with 124,402 items and finally the Gateway to Oklahoma History (OK in the chart) with 25,809 new items.

If you look at files (often ‘pages’) instead of items the graph will change a bit.

New Pages by System

While we added the most items to The Portal to Texas History, we added the most pages of content to the UNT Digital Library.  In total we added 5,704,046 files to the Digital Collections in 2016.

Added by Date

The number of items added per month is a good way of getting an overview of activity across the year.  The graphic below presents that data.

New Items By Month

The average number of items added per months is 24,590 which is a very respectable number. When you look at the number of items added on a given day during the year, the graph is a bit harder to read but you can see some days that had quite a bit of data loading going on.

New Items Added Per Day

As you can see it is a bit harder to tell what is going on.  some days of note include May 19th that had 19,858 items processed and uploaded, March 19th with 16,649, and January 13th with 13,338 new items added.  there are at least six other days with over 10,000 items processed and added to the digital collections.

If you take the number of items and spread them across the entire year you will get an average of 808 items loaded into the system per day.  Not bad at all. There were actually 165 days during 2016 that there weren’t any items added to the Digital Collections which leaves an impressive 200 days that new content was being processed and loaded. When you remove weekends you are left with content being added almost four days a week.

Another fun number to think about is that if we added an average of 808 items per day during 2016.  That’s 33.6 items added per hour during the day, for just about one item created and added every thirty seconds.

Items by Type

Next up is to take a look at what kind of items were added throughout the year.  I’m going to base these numbers off of the resource type field for each of the records.  If for some reason the item doesn’t have a resource type set then it will have a value of None.

Resource Type Item Count % of Total
text_newspaper 124,662 42.25%
text_report 56,279 19.07%
image_photo 42,203 14.30%
text_article 31,129 10.55%
video 12,238 4.15%
text_script 7,230 2.45%
sound 4,956 1.68%
image_drawing 4,097 1.39%
text_etd 2,763 0.94%
text 2,365 0.80%
text_leg 1,433 0.49%
image_postcard 1,193 0.40%
text_journal 886 0.30%
text_book 858 0.29%
text_pamphlet 778 0.26%
text_letter 541 0.18%
None 523 0.18%
text_clipping 174 0.06%
physical-object 144 0.05%
image_presentation 125 0.04%
text_legal 111 0.04%
text_review 107 0.04%
image_poster 89 0.03%
text_yearbook 47 0.02%
text_paper 37 0.01%
dataset 29 0.01%
image_map 22 0.01%
website 11 0.00%
image 11 0.00%
image_score 11 0.00%
image_artwork 8 0.00%
text_chapter 7 0.00%
collection 5 0.00%
text_poem 3 0.00%
interactive-resource 2 0.00%

I’ve taken the ten most commonly added item types, which account for over 97% of items added to the system and made a little pie chart out of them below.

Item by Type

Item by Type

As you can see the Digital Collections added a large number of newspapers over the past year.  Newspapers accounted for 124,662 or 43% of new items added to the system.  There were a large number of reports, photographs, and articles added as well.  Coming in at the fifth most added type are videos of which we added 12,238 new video items.

Items by Partner

Because we work with a number of partners here at UNT, across Texas, and into Oklahoma we upload content into the system associated with one partner. Throughout the year we added items to 154 different partner collections in the UNT Libraries’ Digital Collections.  I’ve presented the ten partners that contributed the most content to the collections in 2016.

Partner Partner Code Item Count Item Percentage
UNT Libraries Government Documents Department UNTGD 90,393 30.63%
UNT Libraries’ Special Collections UNTA 32,263 10.93%
Oklahoma Historical Society OKHS 25,786 8.74%
Texas Historical Commission THC 25,222 8.55%
UNT Libraries UNT 15,319 5.19%
Cuero Public Library CUERPU 5,901 2.00%
Nellie Pederson Civic Library CLIFNE 5,881 1.99%
Coleman Public Library CLMNPL 5,729 1.94%
Gladys Johnson Ritchie Library GJRL 4,850 1.64%
Abilene Christian University Library ACUL 4,359 1.48%

You can see that we had a strong year for the UNT Libraries’ Government Documents Department that added over 90,000 items to the system.  We have been ramping up the digitization activities for the UNT Libraries’ Special Collections and you can see the results with over 32,000 new items being added to the UNT Digital Library.

Closing

I think that’s just about it for the year overview of new content added to the UNT Libraries’ Digital Collections.  Next up I’m going to dig into some usage data that was collected from 2016 and see what that can tell us about last year.

I’m quite impressed with the amount of content that we added in 2016.  Adding 295,077 to the Digital Collections brought us to 1,751,015 items and 26,326,187 files (pages) of content in the systems.  I’m looking forward to 2017 and what it has in store for us.  At the rate we added content in 2016 I have a strong feeling that we will be passing the 2 million item mark.

If you have questions or comments about this post,  please let me know via Twitter.

LC Name Authority File Analysis: Where are the Commas?

This is the second in a series of blog posts on some analysis of the Name Authority File dataset from the Library of Congress. If you are interested in the setup of this work and bit more background take a look at the previous post.

The goal of this work is to better understand how personal and corporate names are formatted so that I can hopefully train a classifier to automatically identify a new name into either category.

In the last post we saw that commas seem to be important in differentiating between corporate and personal names.  Here is a graphic from the previous post.

Distribution of Commas in Name Strings

You can see that  the majority of personal names have commas 99% with a much smaller set of corporate names 14% having a comma present.

The next thing that I was curious about is does that placement of the comma in the name string reveal anything about the kind of name that it is?

How Many?

The first thing to look at is just counting the number of commas per name string.  My initial thought is that there are going to be more commas in the Corporate Names than in the Personal Names.  Let’s take a look.

Name Type Total Name Strings Names With Comma min 25% 50% 75% max mean std
Personal 6,362,262 6,280,219 1 1 1 2 8 1.309 0.471
Corporate 1,499,459 213,580 1 1 1 1 11 1.123 0.389

In looking at the overall statistics for the number of commas in the name strings indicate that there are more commas for the Personal Names than for the Corporate Names.  The Corporate Name with the most commas, in this case eleven is International Monetary Fund. Office of the Executive Director for Antigua and Barbuda, the Bahamas, Barbados, Belize, Canada, Dominica, Granada, Ireland, Jamaica, St. Kitts and Nevis, St. Lucia, and St. Vincent and the Grenadines you can view the name record here.

The Personal Name with the most commas had eight of them and is this name string Seu constante leitor, hum homem nem alto, nem baixo, nem gordo, nem magro, nem corcunda, nem ultra-liberal, que assistio no Beco do Proposito, e mora hoje no Cosme-Velho and you can view the name record here.

I can figure out the Corporate Name but needed a little help with the Personal Name so Google Translate to the rescue. From what I can tell that translate to His constant reader, a man neither tall, nor short, nor fat, nor thin, nor hunchback nor ultra-liberal, who attended in the Alley of the Purpose, and lives today in Cosme-Velho which I think is a pretty cool sounding Personal Name.

I was surprised when I made a histogram of the values and saw that it was actually pretty common for Personal Names to have more than one comma.   Very common actually.

Number of Commas in Personal Names

And while there are instances of more overall commas in Corporate Names, you generally are only going to see one comma per string.

Number of Commas in Corporate Names

Which Half?

The next thing that I wanted to look at is the placement of the first comma in the name string.

The numbers below represent the stats for just the name strings that contain a comma. The values of the number is the position of the first comma as a percentage of the overall number of characters in the name string.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1.9% 26.7% 36.4% 46.7% 95.7% 37.3% 13.8%
Corporate 213,580 2.2% 60.5% 76.9% 83.3% 99.0% 69.6% 19.3%

If we look at these as graphics we can see some trends a bit better.  Here is a histogram of the placement of the first comma in the Personal Name strings.

Comma Percentage Placement for Personal Name

It shows the bulk of the names with a comma have that comma occurring in the first half (50%) of the string.

This looks a bit different with the Corporate Names as you can see below.

Comma Percentage Placement for Corporate Name

You will see that the placement of that first comma trends very strongly to the right side of the graph, definitely over 50%.

Let’s be Absolute

Next up I wanted to take a look at the absolute distance from the first comma to the first space character in the name string.

My thought is that a Personal Name is going to have an overall lower absolute distance than the Corporate Names.  Two examples will hopefully help you see why.

For a Personal Name string like “Phillips, Mark Edward” the absolute distance from the first comma to the first space is going to be one.

For a Corporate Name string like “Worldwide Documentaries, Inc.” the absolute distances from the first comma to the first space is fourteen.

I’ll jump right to the graphs here.  First is the histogram of the Personal Name strings.

Personal Name: Absolute Distance Between First Space and First Comma

You can see that the vast majority of the name strings have an absolute distance from the first comma to the first space of 1 (that’s the value for the really tall bar).

If you compare this to the Corporate Name strings in graph below you will see some differences.

Corporate Name: Absolute Distance Between First Space and First Comma

Compared to the Personal Names, the Corporate Name graph has quite a bit more variety in the values.  Most of the values are higher than one.

If you are interested in the data tables they can provide some additional information.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 131 1.4 1.8
Corporate 213,580 1 18 27 37 270 28.9 17.4

Absolute Tokens

This next section is very similar to the previous but this time I am interested in the placement of the first comma in relation to the first token in the string.  I have a feeling that it will be similar to what we saw for the absolute first space distance that we saw above but should normalize the data a bit because we are dealing with tokens instead of characters.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 17 1.1 0.3
Corporate 213,580 1 3 4 6 35 4.8 2.4

And now to round things out with graphs of both of the datasets for the absolute distance from first comma to first token.

Personal Name: Absolute Distance Between First Token and First Comma

Just as we saw in the section above the Personal Name strings will have commas that are placed right next to the first token in the string.

Corporate Name: Absolute Distance Between First Token and First Comma

The Corporate Names are a bit more distributed away from the first token.

Conclusion

Some observations that I have now that I’ve spent a little more time with the LC Name Authority File while working on this post and the previous one.

First, it appears that the presence of a comma in a name string is a very good indicator that it is going to be a Personal Name.  Another thing is that if the first comma occurs in the first half of the name string it is most likely going to be a Personal Name and if it occurs in the second half of the string it is most likely to be a Corporate Name. Finally the absolute distance from the first comma to either the first space or from the first token is a good indicator of it the string is a Personal Name or a Corporate Name.

If you have questions or comments about this post,  please let me know via Twitter.