Preservation Event and Workflow Recommendations

Overview

This document outlines recommendations for repository Events and Workflows as they relate to preservation. For the purposes of this document, Events are defined as a discrete action or a series of actions performed on a digital object in preparation for ingest into the repository.  Workflows are defined as a series of events performed on a digital object.

In the first section of this document, the Digital Preservation Functional Requirements Group (DP-FRG) identifies Events the repository will perform on digital objects, provides the PREMIS definition of those events, and specifies additional information the DP-FRG deemed important to provide to the DLP implementation team.

As stated, the DP-FRG has supplied notes on each Event to provide recommendations to the implementation team in advance of their work to design and develop Emory’s preservation repository.  It is noted that the implementation team may, during the course of developing the repository, identify additional requirements or change some of the recommendations based on technical feasibility and/or limitations. It is assumed that all software developed to perform an Event will be idempotent, meaning if the software performs an action on the same object multiple times, the software will achieve the same result each time.

In addition to identifying requirements for the DLP implementation team, the DP-FRG notes will help inform the Metadata Implementation Working Group in developing a specification for preservation metadata.  For all Events, it is assumed that the date and time the event is performed is recorded, as is the agent (human or machine) performing the Event. When recording that a human has performed an Event, the repository should record which human has initiated the work.

In the second section of this document, the DP-FRG identifies four types of Workflows, defines those Workflows, and identifies which Events would be performed within those workflows. It may be the case that not all digital objects will pass through all Events within a workflow, the DP-FRG has noted those exceptions.

It should be noted that this isn’t a complete list of all the Events and Workflows provided by the repository.  Instead, this document outlines events and workflows the DP-FRG deems necessary for appropriately preserving content for long-term access. This list may change over time as technology and the field of digital preservation advances

Repository Events

The Digital Preservation Functional Requirements Group (DP-FRG) has identified Events the repository may record as having been performed on digital objects.  The DP-FRG used the Library of Congress Preservation Event Type Vocabulary, and selected Events it felt would be most useful for staff responsible for providing long-term access to content.

Events by Event Type

There are two types of events the repository may perform: Automated Events and Human Initiated Events.

Automated Events are Events the repository performs on digital objects automatically, either as part of normal operations or automatically as part of its Workflows. 

Human Initiated Events are those events initiated by a user through features implemented by the repository’s applications.  For example, in a self-deposit application, a user may choose to create a new version of a digital object and then modify its descriptive metadata.  In this example, the repository has not automatically edited the descriptive metadata, but instead the user has chosen to do so via the application.

Additionally, Human Initiated Events may be performed outside the repository: for instance, prior to the deposit of a new digital object; or in the process of creating a new version of an existing digital object (which is then re-deposited into the repository).

List of Repository Events

Below is a list of Events the DP-FRG has identified as those the repository may perform.

Event Name

(PREMIS or Local)

Definition

(PREMIS or Local)

Event Type

DP-FRG Notes

Fixity Check

The process of verifying that an object has not been changed in a given period. This event will mostly likely utilize the results of the "message digest calculation" event.


Automated/Human Initiated

This event is utilized when determining whether or not fixity on the components of an object have changed.  It is different from the message digest calculation which generates the first hash for an object.

The repository should record whether the digital object has passed or failed its fixity check. If the digital object has failed its fixity check, the repository should note which files within the object failed the fixity check.

The repository should record both Automated and Human Initiated fixity checks.

Format Identification

The process of determining the object's file format and version. Note that this event is different from validation, which compares the object to known format specifications.


Automated

In addition to capturing the standard event information, the repository should capture the format of each of the files associated with a digital object. 

If the repository is unable to identify a file format for a component of a digital object, the repository should record that no format was identified, which file failed to have its format identified, and it should then notify staff depositors no format has been identified. Note that self-depositors should not be notified that a digital object has failed format identification.

In addition to the above information, the repository should also record what tool was used to identify the file format (FITS, JHOVE, etc.).


[Workflow] End

The process of completing a workflow.


Note this is not a PREMIS defined event.


Automated

This would be the last step of the any workflow. It is essentially a strawman event, a sign post to indicate that something has finished. It would automatically complete once all other events in the ingest workflow complete.

[Workflow] Start

The process of starting a workflow.


Note this is not a PREMIS defined event.


Automated

This would be the first step of any workflow. It is essentially a strawman event, a sign post to indicate that something has started. It would automatically complete once the workflow starts.

Message digest calculation

The process by which a message digest ("hash") is created.


Automated

The repository will calculate SHA1, MD5Sum, and SHA256 for all of the components of a digital object.

The repository can calculate the initial hashes for an object when it first takes the files in for accessioning.

However, users may wish to supply a hash at the beginning of the accession workflow. The repository should allow for this use case.  If the user provides hashes but does not supply the date, time, and agent (i.e. the standard event information), the message digest calculation date, time, and agent will be recorded as the date and time of the deposit and as the depositor performing the deposit.

If the depositor supplies hashes, the repository will still generate its own set of hashes and then perform a fixity check to ensure the hashes match.  


Metadata Extraction

The process of extracting metadata from an object.  This includes technical, administrative, and descriptive metadata.

Automated

Noting there are multiple types of metadata, the repository should record which type of metadata is extracted (descriptive, technical, rights, etc.) alongside the standard date, time, agent information. 

Additionally, the repository should record where the metadata is extracted from (e.g. Alma, ArchivesSpace, FITS, JHOVE, etc.) and what is the standard for the originating metadata (e.g. the metadata extracted comes from MARC, EAD, Dublin Core, etc.). 


Quarantine

The process of segregating objects for designated periods of time.


Automated

Not all digital objects will pass through a virus check, for example files coming off of the scanner may not need to pass through a virus check.

When a digital object passes through the virus checking software, the software should attempt to clean the digital object’s files.  If the virus check software is unable to clean the file, the repository should place the object in quarantine and notify the depositor of the failure. The depositor can then determine next steps.

The repository should record which files of the digital object were the cause of placing the object in quarantine.


Replication

The process of creating a copy of an object that is, bit-wise, identical to the original.

Automated

The repository should create at least three copies of a digital object.  When the repository creates a copy of a digital object it should record the location of the additional copies and the identifiers if the additional copies are sent to a third-party preservation service.


Validation

The process of comparing an object with a standard and noting compliance or exceptions. The object being validated may be a file or an information package.


Automated

There are multiple types of validation the repository may perform. For example, the repository may validate appropriate metadata has been supplied, or the repository may validate that rights have been assigned, or the repository may validate the file format.

Regardless of the type of validation the repository should record the type of validation performed. Additionally, it should record whether or not the validation was successful.  If possible, the repository should collect any error messages to aid staff in determining why the digital object or one of its components failed validation.

It should be noted that failing file validation should not preclude content from being ingested into the repository.


Virus check

The process of scanning a file for malicious programs.


Automated

See notes associated with quarantine. 

The repository should record whether or not the files of a digital object pass the virus check.  If a file does not pass the virus check, which file failed the virus check should be recorded as well.

The repository should also record which virus checking software and version was used to perform the virus check.


Normalization

An act of transforming an object into an institutionally supported preservation format.

Automated / Human Initiated

Normalization is performed to ensure long term access to an object or to ensure it is easier to migrate an object to a different format at a later date. Note that not all normalization will happen automatically in the repository.  It will be easier to automatically perform normalization for common formats that Emory generates (i.e. TIFFS, OCR, and similar files used to automatically generate a PDF) and for which there are widely available tools to help generate those formats (FFMPEG, ImageMagick, tiff2pdf, etc.). But it will not be the case that the repository can normalize all files that are deposited.

When normalization is performed automatically, the repository should record which tools were used in the normalization process and the outcome of the normalization (i.e. success or failure).

When it is not possible for the repository to automate normalization, depositors should be able to deposit a normalized file as part of the digital object.  In these situations, the repository may only record the standard event information, although staff depositors may choose to provide the additional Automated Event information at the time of deposit.

Capture

The process whereby a repository actively obtains an object through means other than a transfer from the creator/donor.


Automated/Human Initiated

This process is distinct from the metadata extraction process which only extracts a component of a digital object (i.e. descriptive metadata, technical metadata, etc.).  Instead this event focuses on the process of harvesting an entire digital object from a third-party source.  Existing processes that utilize this event are the OpenEmory application developed for Scholarly Communications or Internet Archive’s web archiving service utilized by Emory’s University Archives.

In addition to capturing the standard event information, the repository should capture which service was utilized to perform the capture.

Filename Change

The process of modifying a filename.

Human Initiated

The repository should note what the files name was and what the filename was changed to.


Information Package Merging

The process of merging two or more Information Packages (SIP, AIP, or DIP) into one Information Package of the same type. This event does not cover moving multiple information packages across types.


Human Initiated

[Place holder]

Metadata Modification

The process of making changes to the metadata of an object.


Human Initiated

For the purposes of the repository, all types of metadata, not just descriptive metadata, are considered.

For the purposes of preservation, metadata is deemed modified when a depositor creates a new version of a digital object and then re-submits the digital object for ingest.  Metadata edits should not automatically be versioned and re-ingested into the repository, instead it should be the responsibility of the depositor to determine when a metadata edit requires resubmitting to the repository for ingest.

If metadata is modified and ingest doesn’t complete for whatever reason, when the object goes through the workflow again, any additional metadata updates will be captured.


Policy Assignment

The process of assigning a policy to an object. Policies can be related to rights, preservation, access, etc.


Human Initiated

The discovery phase of Emory’s preservation repository has only identified requirements for rights and access policies. Initially there was discussion of creating preservation levels, but those levels were discarded in favor of a single preservation policy for all content deposited into the repository.

The repository should record which policy has been assigned to a digital object.


Recovery

The act of regaining one or more files after a disaster. Usually occurs as part of a disaster recovery process.


Human Initiated

The repository is unable to automate recovery of an object, files, or metadata; however, if recovery is necessary, it would be the responsibility of repository administrators to record in the repository the date of the failure, a note about what the failure was, and the date the failure was rectified.


Redaction

The process of modifying the content of a digital object to remove or mask information considered to be sensitive in nature (that is, the information cannot be viewed by non-authorized users of the repository). Redaction usually takes place on a copy of the object.


Human Initiated

When content is redacted, users should supply a note containing information about what has been redacted and/or why information has been redadcted.

Un-quarantine

The process of releasing a file from quarantine.


Human Initiated

When a file is un-quarantined, staff should be required to enter a note about how the virus was cleaned from the file.

It should be noted that files that we are unable to un-quarantine should not be deposited into the repository.



Other Events

Events that are not listed in the List of Repository Events section above may still be performed outside the repository. Events that fall within this category are often Events that are performed through programmatic means; that is to say a digital object(s) are exported for modification, scripts are executed to modify all or some portion of the object(s), and the modified object(s) are imported back into the repository. 

These types of Events often occur when a human creates, develops, or instigates the Event. For example:

  • a human requests a developer write a script to modify metadata en masse
  • a human migrates a file from format A to format B or a human requests a developer write a script to perform the migration
  • a human performs actions on the objects prior to ingest

For Events that fall within this category depositors may deposit supplemental files with any information they deem relevant about the actions performed on the digital objects.  If the depositor is creating a new version of an object, the depositor will be given the opportunity to add a note indicating why they are creating the new version.

Repository Workflows

The Events outlined above will be performed within the repository’s Workflows. Below are four types of Workflows the DP-FRG have identified for inclusion in the repository.  Note that the Accession and Dissemination Workflows may have multiple instantiations based on the context from which they are derived (e.g. a self-deposit accession workflow, a book scanner accession workflow, a Hathi Trust dissemination workflow, a DPN dissemination workflow, etc.).

Accession Workflow

An Accession Workflow is the process by which depositors gather together the components of a digital object for submission to Emory’s preservation repository (i.e. developing a submission information package or SIP). Note that the Accession Workflow may have multiple instantiations depending on the method of deposit (self-deposit v. staff-deposit) or the stream from which the content originates (born-digital v. reformatted).

List of Accession Events

Below is a list of events that may occur within an Accession Workflow.


Event Name

[Workflow] End

[Workflow] Start

Fixity Check

Message digest calculation

Metadata Extraction

Quarantine

Virus check

Normalization

Capture

Information Package Merging

Policy Assignment

Un-quarantine

Dissemination Workflow

A Dissemination Workflow is the process by which the repository generates the components of a digital object to be distributed to Emory’s designated dissemination services[1] (i.e. developing a dissemination information package or DIP). Note that the Dissemination Workflow may have multiple instantiations depending on the method of dissemination (preservation dissemination v. access dissemination) or service the digital object is being disseminated to (Hathi-Trust v. Internet Archive). Also note that most, if not all Dissemination Workflows should follow the Ingest Workflow. This will ensure that the object being disseminated is one the repository has control over.

List of Dissemination Events

Below is a list of events that should occur within the Dissemination Workflow.


Event Name

[Workflow] End

[Workflow] Start

Fixity Check

Metadata Extraction

Replication

Normalization

Ingestion Workflow

The Ingestion Workflow is the process by which the repository gathers together and/or generates the components of a digital object that will be preserved (i.e. developing an archival information package or AIP). Ideally, there will be one single Ingestion Workflow using the components of a SIP to develop the AIP. If it is not possible to have a single ingestion workflow, we would want them to conform as much as possible to the same events.  Note the Ingestion Workflow should follow the Accession Workflow and precede the Dissemination Workflow.

List of Ingestion Events

Below is a list of events that should occur within the Ingestion Workflow.


Event Name

[Workflow] End

[Workflow] Start

Fixity Check

Format Identification

Metadata Extraction

Validation

Version Workflow

The Version Workflow is the process by which a user of the repository modifies an existing digital object. This work may be performed outside the repository, and the object versioned to capture changes, and then re-deposited.  Ideally there will be a single Version Workflow that will allow users to pick why they are creating a new version and/or provide notes on the new version. The Version Workflow should only be performed on those objects that have been fully ingested by the repository (i.e. have completed the Ingestion Workflow). Once the Version Workflow is completed, the object should be sent back through some combination of the Accession, Ingestion, and Dissemination Workflows.

List of Version Events

Below is a list of events that should occur within the Version Workflow.

Note that Fixity and Message digest calculation are here not because they are actions that constitute the creation of a new version, but because a user may upload files when creating a new version of an object, and these events should happen whenever files are uploaded.


Event Name

[Workflow] End

[Workflow] Start

Fixity Check

Message digest calculation

Capture

Filename Change

Information Package Merging

Metadata Modification

Policy Assignment

Redaction




[1] Designated dissemination services are outlined in the Third Party Dissemination Policy of the Digital Collections Steering Committee Policy Suite.


Contents