How we assign unique identifiers

The UNT Libraries has made use of the ARK identifier specification for a number of years and have used these identifiers throughout our infrastructure on a number of levels.  This post is to give a little background about where, when, why and a little about how we assign our ARK identifiers.

Terminology

The first thing we need to do is get some terminology out of the way so that we can talk about the parts consistently.  This is taken from the ARK documentation

  http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff
   ________________/ __/ ___/ ______/ ____________/
     (replaceable)     |     |      |       Qualifier
          |       ARK Label  |      |    (NMA-supported)
          |                  |      |
Name Mapping Authority       |    Name (NAA-assigned)
         (NMA)               |
                  Name Assigning Authority Number (NAAN)

The ARK syntax can be summarized,

  [http://NMA/]ark:/NAAN/Name[Qualifier]

For the UNT Libraries we were assigned a Name Assigning Authority Number (NAAN) of 67531 so all of our identifiers will start like this ark:/67531/

We mint Names for our ARKs locally with a home-grown system locally called a “Number Server”  this Python Web service receives a request for a new number,  assigns that number a prefix based on which instance we pull from and returns the new Name.

Namespaces

We have four different namespaces that we use for minting identifiers.  They are the following,  metapth, metadc, metarkv, and coda.  Additionally we have a metatest namespace which we use when we need to test things out but it isn’t used that often.  Finally we have a historic namespace that is no longer used that is metacrs. Here is the breakdown of how we use these namespaces.

We try to assign all items that end up on The Portal to Texas History with Names from the metapth namespace whenever possible.  We assign all other public facing digital objects the metadc namespace.  This means that the UNT Digital Library and The Gateway to Oklahoma History both share Names from the metadc namespace.  The metarkv namespace is used for “archive only” objects that go directly into our archival repository system,  these include large Web archiving datasets.  The coda namespace is used within our archival repository called Coda.  As was stated earlier the metatest namespace is only used for testing and these items are thrown away after processing.

Name assignment

We assign Names in our systems in programatic ways,  this is always done as part of our digital item ingest process.  We tend to process items in batches,  most often we try to process several hundred items at any given time and sometimes we process several thousand items.   When we process items they are processed in parallel and therefore there is no logical order to how the Names are assigned to objects.  They are in the order that they were processed but may have no logical order past that.

We also don’t assume that our Names are continuous.  If you have an identifier metapth123 and metapth125 we don’t assume that there is an item metapth124,  sure it may be there,  but it also may never have been assigned.  When we first started with these systems we would get worked up if we assigned several hundred or a few thousands identifiers and then had to delete those items,  now this isn’t an issue at all but that took some time to get over.

Another assumption that can’t be made in our systems is that if you have an item,  Newspaper Vol 1 Issue 2 that has an identifier of metapth333 there is no guarantee that Newspaper Vol. 1 Issue 3 will have metapth334,  it might but it isn’t guaranteed either.  Another thing that happens in our systems is that items can be shared between systems and the membership to either the Portal, UNT Digital Library or Gateway is notated in the descriptive metadata.  Therefore you can’t say all metapth* identifiers are Portal or all metadc* identifiers are not the Portal, you have to look them up based on the metadata.

Once a number is assigned it is never assigned again.  This sounds like a silly thing to say but it is important to remember,  we don’t try and save identifiers, or reuse them as if we will run out of them.

Level of assignment

We currently assign an ARK identifier at the level of the intellectual object. So for example,  a newspaper issue gets and ARK, a photograph gets an ARK, a book, a map, a report, an audio recording, a video recording gets an ARK.  The sub-parts of an item are not given further unique identifiers because the way that we tend to interface with them is in the form of formatted URLs such as those described here or from other URL based patterns such as the URLs we use to retrieve items from Coda.

http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/manifest-md5.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/coda_directives.py
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bagit.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/bag-info.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/0=untl_aip_1.0
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/queries.xlsx
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/data/01_data/README.txt
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/ba3ce7a1-0e3b-44cb-8b41-5d9d1b0438fe.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadata/7fe68777-54a2-4c71-95b2-aa33204ae84b.jhove.xml
http:/coda.library.unt.edu/bag/ark:/67531/codanaf8/data/metadc498968.aip.mets.xml

Lessons Learned

Things I would do again.

  • I would most likely use just an incrementing counter for assigning identifiers.  Name minters such as Noid are also an option but I like the numbers with a short prefix.
  • I would not use a prefix such as UNT do stay away from branding as much as possible.  Even metapth is way too branded (see below).

Things I would change in our implementation.

  • I would only have one namespace for non-archival items.  Two namespaces for production data just invite someone to screw up (usually me) and then suddenly the reason for having one namespace over the other is meaningless.  Just manage one namespace and move on.
  • I would not have a six or seven character prefix.  metapth and metadc came as baggage from our first system,  we decided that the 30k identifiers we already minted had set our path.  Now after 1,077,975 identifiers in those namespaces,  it seems a little silly that those the first 3% of our items would have such an effect on us still today.
  • I would not brand our namespaces so closely to our systems names such as metapth, metadc, and the legacy metacrs people read too much into the naming convention.  This is a big reason for opaque Names in the first place, and is pretty important.

Things I might change in a future implementation.

  • I would probably pad my identifiers out to eight digits.   While you can’t rely on the ARKs to be generated in a given order, once they are assigned it is helpful to be able to sort by them and have a consistent order,  metapth1, metapth100, metapth100000 don’t always sort nicely like metapth00000001, metapth00000100, metapth00100000 do.  But then again longer run numbers of zeros are harder to transcribe and I had a tough time just writing this example.  Maybe I wouldn’t do this.

I don’t think any of this post applies only to ARK identifiers as most identifier schemes at some level have to have a decision made about how you are going to mint unique names for things.   So hopefully this is useful to others.

If you have any specific questions for me let me know on twitter.