One of the things that I keep coming back to in our digital library system are the states that an object can be in and how that affects various aspects of our system. Hopefully this post can explain some of them and how they are currently implemented locally.
Hidden vs Non-Hidden
Our main distinction once an item is in our system is if it is hidden or not.
Hidden means that it is not viewable by any of our users and that it is only available in our internal Edit system where a metadata record and basic access exists to the item. If a request for this items comes in through our public facing digital library interfaces, the user will receive a “404 Not Found” response from our system.
If a record is not hidden then it is viewable and discoverable in one of our digital library interfaces. If an end user tries to access this item there may be limitations based on the level of access, or any embargoes on the item that might be present.
In our metadata scheme UNTL, we notate if an item is hidden or not in the following way. If there is a value of <meta qualifier=”hidden”>True</meta> then the item is considered hidden. If there is a value of <meta qualifier=”hidden”>False</meta> then the item is considered not hidden. If there is no element with qualifier of hidden then the default is placed as False in the system and it is considered not hidden.
This works pretty well for basic situations and with the assumption that nobody will ever make a mistake.
But… People make mistakes.
Deleted Items
The first issue we ran into when we started to scale up our systems is that from time to time we would accidentally load the same resource into the system twice. This happens for a variety of reasons. User error on the part of the ingest technician (me) is the major cause of this. Also there are a number of times that the same item will be sent through the digitization/processing queue a number of times because of the amount of time that passes for some projects to complete. There are other situations where the same item will be digitized again because the first instance was poorly scanned, and instead of updating the existing record it is added a second time. For all of these situations we needed to have a way of suppressing these records
Right now we add an element to the metadata record that is <meta qualified=recordStatus”>deleted</meta> which designates that this item has been suppressed in the system and that it should be effectively forgotten. On the technical side this triggers a delete from the Solr index, which holds our metadata indexes and the item is then gone.
When a user requests an item that is deleted she will currently receive a “404 Not Found” though we have an open ticket to change this behavior to return a “410 Gone” status code for these items. Another limitation of our current process of just deleting these from our Solr index is that we are not able to mark them as “deleted” in our OAI-PMH repositories which isn’t ideal. Finally by purging these items completely from our system we have no way of knowing how many have been suppressed/deleted, or not easy way of making the items visible again.
These suppressed records are only deleted from the Solr index but all of their edit history and the records themselves. In fact if you know that an item used to be in a non-suppressed state, and remember the ARK identifier you can still access the full record, remove the recordStatus flag and un-suppress the item. Assuming you remember the identifier.
What does hidden really mean?
So right now we have hidden, and non-hidden and deleted and non-deleted. The deleted items are effectively forgotten about, but what about those hidden items, what do they mean.
Here are some of the reasons that we have hidden records vs non-hidden records.
Metadata Missing
We have a workflow for our system that allows us to ingest stub records which have minimal descriptive metadata in place for items so that they can be edited in our online editing environment by metadata editors around the library, university, and state. These are loaded with minimal title information (usually just the institution’s unique identifier for the item), the partner and collection that the item belongs to, and any metadata that makes sense to set across a large set of records. Once in the editing system these items will have metadata created for them over time and be made available to the end user.
Hard Embargoes
While our system has built-in functionality for embargoing an item, this functionality will always make available the descriptive metadata for the item to the public. In our UNT Scholarly Works Repository, we work to make the contact information for the creators of the item known so that you can “request a copy” of the item if you discover it but if it is still under an embargo. Here is an example item that won’t become available until later this year.
Sometimes this is not the desired way of presenting the embargoed items to the public. For example we work with a number of newspaper publishers around Texas who make available their PDF print masters to UNT for archiving and presentation via The Portal to Texas History. They do so with the agreement that we will not make their items available until one, two, or three years after publication. Instead of presenting the end user with an item they aren’t able to access in the Portal, we just have these items hidden until they are ready to be made available. I have a feeling that this method will be changed soon in the future because it becomes a large metadata management problem.
Finally there are items that we are either digitizing or capturing which we do not have the ability to provide access to because of current copyright restrictions. We have these items in a hidden state in the system until either an agreement can be reached with the rights holder, or until the item falls into the public domain.
Right not it is impossible for us to identify how many of these items are being held as “embargoed” by the use of a hidden item flag.
Copyright Challenge, or Personally Identifiable Information
We have another small set of items (less than a dozen… I think) that are hidden because there is an active copyright challenge we are working with for the item, or because the item contained personally identifiable information. Our first step in these situations is to mark the item as hidden until the item or the situations can be resolved. If situation with the item has been successfully resolve and access restored to the item, it is marked as un-hidden.
Others?
I’m sure there are other reasons that an item can be hidden within a system, I would be interested in hearing your reasons within your collections especially if they are different from the ones listed above. I’m blissfully unaware of any controlled vocabularies for these kinds of states that a record might be in within digital library systems so if there is prior work in this area I’d love to hear about it.
As always feel free to contact me via Twitter if you have questions or comments.