LC Name Authority File Analysis: Where are the Commas?

This is the second in a series of blog posts on some analysis of the Name Authority File dataset from the Library of Congress. If you are interested in the setup of this work and bit more background take a look at the previous post.

The goal of this work is to better understand how personal and corporate names are formatted so that I can hopefully train a classifier to automatically identify a new name into either category.

In the last post we saw that commas seem to be important in differentiating between corporate and personal names.  Here is a graphic from the previous post.

Distribution of Commas in Name Strings

You can see that  the majority of personal names have commas 99% with a much smaller set of corporate names 14% having a comma present.

The next thing that I was curious about is does that placement of the comma in the name string reveal anything about the kind of name that it is?

How Many?

The first thing to look at is just counting the number of commas per name string.  My initial thought is that there are going to be more commas in the Corporate Names than in the Personal Names.  Let’s take a look.

Name Type Total Name Strings Names With Comma min 25% 50% 75% max mean std
Personal 6,362,262 6,280,219 1 1 1 2 8 1.309 0.471
Corporate 1,499,459 213,580 1 1 1 1 11 1.123 0.389

In looking at the overall statistics for the number of commas in the name strings indicate that there are more commas for the Personal Names than for the Corporate Names.  The Corporate Name with the most commas, in this case eleven is International Monetary Fund. Office of the Executive Director for Antigua and Barbuda, the Bahamas, Barbados, Belize, Canada, Dominica, Granada, Ireland, Jamaica, St. Kitts and Nevis, St. Lucia, and St. Vincent and the Grenadines you can view the name record here.

The Personal Name with the most commas had eight of them and is this name string Seu constante leitor, hum homem nem alto, nem baixo, nem gordo, nem magro, nem corcunda, nem ultra-liberal, que assistio no Beco do Proposito, e mora hoje no Cosme-Velho and you can view the name record here.

I can figure out the Corporate Name but needed a little help with the Personal Name so Google Translate to the rescue. From what I can tell that translate to His constant reader, a man neither tall, nor short, nor fat, nor thin, nor hunchback nor ultra-liberal, who attended in the Alley of the Purpose, and lives today in Cosme-Velho which I think is a pretty cool sounding Personal Name.

I was surprised when I made a histogram of the values and saw that it was actually pretty common for Personal Names to have more than one comma.   Very common actually.

Number of Commas in Personal Names

And while there are instances of more overall commas in Corporate Names, you generally are only going to see one comma per string.

Number of Commas in Corporate Names

Which Half?

The next thing that I wanted to look at is the placement of the first comma in the name string.

The numbers below represent the stats for just the name strings that contain a comma. The values of the number is the position of the first comma as a percentage of the overall number of characters in the name string.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1.9% 26.7% 36.4% 46.7% 95.7% 37.3% 13.8%
Corporate 213,580 2.2% 60.5% 76.9% 83.3% 99.0% 69.6% 19.3%

If we look at these as graphics we can see some trends a bit better.  Here is a histogram of the placement of the first comma in the Personal Name strings.

Comma Percentage Placement for Personal Name

It shows the bulk of the names with a comma have that comma occurring in the first half (50%) of the string.

This looks a bit different with the Corporate Names as you can see below.

Comma Percentage Placement for Corporate Name

You will see that the placement of that first comma trends very strongly to the right side of the graph, definitely over 50%.

Let’s be Absolute

Next up I wanted to take a look at the absolute distance from the first comma to the first space character in the name string.

My thought is that a Personal Name is going to have an overall lower absolute distance than the Corporate Names.  Two examples will hopefully help you see why.

For a Personal Name string like “Phillips, Mark Edward” the absolute distance from the first comma to the first space is going to be one.

For a Corporate Name string like “Worldwide Documentaries, Inc.” the absolute distances from the first comma to the first space is fourteen.

I’ll jump right to the graphs here.  First is the histogram of the Personal Name strings.

Personal Name: Absolute Distance Between First Space and First Comma

You can see that the vast majority of the name strings have an absolute distance from the first comma to the first space of 1 (that’s the value for the really tall bar).

If you compare this to the Corporate Name strings in graph below you will see some differences.

Corporate Name: Absolute Distance Between First Space and First Comma

Compared to the Personal Names, the Corporate Name graph has quite a bit more variety in the values.  Most of the values are higher than one.

If you are interested in the data tables they can provide some additional information.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 131 1.4 1.8
Corporate 213,580 1 18 27 37 270 28.9 17.4

Absolute Tokens

This next section is very similar to the previous but this time I am interested in the placement of the first comma in relation to the first token in the string.  I have a feeling that it will be similar to what we saw for the absolute first space distance that we saw above but should normalize the data a bit because we are dealing with tokens instead of characters.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 17 1.1 0.3
Corporate 213,580 1 3 4 6 35 4.8 2.4

And now to round things out with graphs of both of the datasets for the absolute distance from first comma to first token.

Personal Name: Absolute Distance Between First Token and First Comma

Just as we saw in the section above the Personal Name strings will have commas that are placed right next to the first token in the string.

Corporate Name: Absolute Distance Between First Token and First Comma

The Corporate Names are a bit more distributed away from the first token.

Conclusion

Some observations that I have now that I’ve spent a little more time with the LC Name Authority File while working on this post and the previous one.

First, it appears that the presence of a comma in a name string is a very good indicator that it is going to be a Personal Name.  Another thing is that if the first comma occurs in the first half of the name string it is most likely going to be a Personal Name and if it occurs in the second half of the string it is most likely to be a Corporate Name. Finally the absolute distance from the first comma to either the first space or from the first token is a good indicator of it the string is a Personal Name or a Corporate Name.

If you have questions or comments about this post,  please let me know via Twitter.