Archival Information Package Recommendations

Overview

This document seeks to define the components of Emory’s Archival Information Package (AIP) for Emory’s forthcoming preservation repository.  Emory’s Digital Preservation Functional Requirements Group (DP-FRG) was influenced throughout its tenure by ISO 14721, also known as the OAIS Reference Model.  This document pulls from the concepts identified in the OAIS Reference Model in order to develop recommendations for components of the AIP.

The OAIS Reference Model states: “The AIP is defined to provide a concise way of referring to a set of information that has, in principle, all the qualities needed for permanent, or indefinite, Long Term Preservation of a designated Information Object.”[1]  It then goes on to say that the AIP contains the Content Information (i.e. the content the OAIS seeks to preserve) and the Preservation Description Information (i.e. that information which is used to provide trust, access, and context by the OAIS over an indefinite period of time). Finally, the standard states that “The contents of each type of PDI are left to the discretion of the individual Archive.”[2] 

In defining the components of Emory’s Archival Information Package, the document will also define the components of the Content Information and the Preservation Description Information. This document is therefore broken up into two sections: Content Information Components and Preservation Description Information Components. In the Content Information Components section of the document, the DP-FRG enumerates which content files to include in the AIP and which content files to exclude.  In the Preservation Description Information Components section of the document, the DP-FRG enumerates what metadata about the Content Information should be preserved by the forthcoming preservation repository.

This document does not assume to make recommendations around the structure of the AIP itself (i.e. how the AIP will be stored on disk or what the file layout for the AIP should look like) as we are aware of the ongoing conversations in the digital preservation community regarding this topic (OCFL, IMLS’s Beyond the Repository Grant and its subsequent grants, PCDM, etc.).  Our hope is that the implementation team will follow these ongoing conversations and choose a direction that is in line with both the larger community’s goals and Emory’s preservation goals.

Content Information Components

Content Information is “A set of information that is the original target of preservation or that includes part or all of that information. It is an Information Object composed of its Content Data Object and its Representation Information.”[3].  Essentially, Content Information is the set of files to which the repository seeks to provide long-term access.

Files Included

Files included in the Content Information are those files deposited by the depositor or their designate (i.e. a content steward, an archivist, a librarian, or other staff member in Emory Libraries). Files may be born digital files (e.g. disk images) or files that were created by digitizing analog content (e.g. image files of a book). Additionally, files may be reformatted[4], migrated[5], or normalized[6] born digital or digitized files.

Files Excluded

Files excluded from the Content Information are those files created by the repository for the purposes of display (i.e. lower resolution images, lower resolution video files, etc.) and dissemination to third-party services. 

Files generated for display

The DP-FRG recognizes that it may be easier for the preservation repository to keep mezzanine[7] copies of audio and video files; however, we will leave the decision of whether or not to keep those files as a component of the Content Information to the discretion of the implementation team so they may sort out issues around feasibility, compute power, and storage costs.

Files generated for dissemination to third-party services

Files created by the repository for the purposes of dissemination to third-party services should not be preserved as part of the Content Information Components. Third-party dissemination services are defined in Emory’s Dissemination Policy.

Preservation Description Information Components

Preservation Description Information (PDI) is “The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information.”[8]  The document then goes on to state that a repository should define its own components of a PDI.  This section outlines Emory’s components of a PDI.

Descriptive metadata

Descriptive metadata “Describes content for search and discovery contexts -- it helps connect users to resources, and provides important context about a resource once it is discovered.”[9]  In this way, descriptive metadata not only allows for end users to be able to understand the context of a digital object, but also helps digital preservationists to understand the context of the digital object.  Therefore, the descriptive metadata schema developed by the Metadata Implementation Working Group should be included as a component of the AIP’s PDI.

Relationship information

Relationship information identifies how digital objects relate to one another within Emory’s Preservation Repository.  Relationship metadata may include what collections digital objects belong to, what digital objects are sub-objects of other digital objects, etc. Information generated by the Preservation Repository to define the relationships between digital objects should be included as a component of the AIP’s PDI.  Relationship information places the object within Context of the other digital objects within the repository.

Structural information

Information about the structure of a digital object identifies how the files of a digital object relate to one another.  For example, a digital object representing a book might include information about what is the first page of the book, what is the last page, and which pages follow each other.  Information generated by the Preservation Repository to define the structure of a digital object should be included as a component of the AIP’s PDI.  Structural information provides Context for the digital object.

Technical/Characterization Metadata

“Technical, aka characterization, metadata refers specifically to a digital asset and provides information about its file composition, such as mimetype, filesize, creating software, compression, etc.”[10]  Like the three components identified before it, Technical/Characterization Metadata is preserved as part of the PDI to provide Context for the digital object.

Source Technical Metadata

In the course of discussion, the DP-FRG also identified that it may be necessary in some cases to capture technical metadata about the analog/source object from which the content is extracted.  This may include, but is not limited to:

  • Original/source information for digitized material
  • Original environment for disk image materials
  • Metadata which applies to the creation of a digital surrogate
  • Optimal rendering environment for disk image material

 

For the purposes of this document, DP-FRG has defined this information as Source Technical Metadata.  Source Technical Metadata may be provided by a depositor or their designate in a supplemental file.  The Preservation Repository should do its best to identify these supplemental files deposited as PDI and preserve the information accordingly (see Supplemental PDI Files Section below for more information).

Rights Metadata

The copyright and license status of the digital object helps to identify the access restrictions placed on the Content Information.   This information should be a component of the PDI to ensure future generations can determine how the digital content can be accessed and re-used.

Once work on Rights Metadata is completed, the Digital Preservation Functional Requirements Group is available to consult with the implementation team on Rights Metadata’s inclusion in Emory’s AIP.

Identifier information

While the Descriptive Metadata provided by the M-IWG does provide for multiple types of identifiers, the DP-FRG feels it important to be explicit that all identifiers, both internal and external to Emory, should be preserved as part of the PDI since this information ensures that the Preservation Repository can unambiguously identify or Reference a digital object.

Preservation Events and Workflows

Information identified in the Preservation Events and Workflow Recommendations[11] document should be included as a component of the PDI.  This information documents the history of the digital object once the Preservation Repository takes custody and also includes information about the fixity of the digital object.

Depositors or their designates may deposit supplementary files that further provide information guaranteeing a digital object’s fixity and adding to the evidence to support the authenticity or Provenance of a digital object.

Supplemental PDI files

While Emory’s preservation repository will provide or generate the PDI components outlined above, there may be situations in which depositors or their designates choose to provide additional supplemental information.  Therefore, Emory’s preservation repository should provide an accommodation for depositors to deposit additional files that will enhance or improve upon the information the repository collects.  Emory’s preservation repository should do its best to identify these files as supplemental PDI files, knowing that it may not be possible to do in all situations (i.e. it may not be clear to users of a self-deposit application that their files are PDI files rather than content-bearing files).

Since the information contained within supplemental PDI files may vary from depositor to depositor or collection to collection, it is recommended that supplemental PDI files be grouped together to distinguish information generated by the repository from these additional enhancing files.


[1] ISO 14721:2012 Page 4-36

[2] ISO 14721:2012 Page 4-37

[3] ISO 14721:2012 Page 1-10

[4] For example, a .mov file reformatted to a .wav file

[5] For example, a PDF file migrated to a PDF/A file

[6] For example, a group of .tif, OCR, and similar files that are packaged together to create a PDF/A

[7] Mezzanine copies are defined as higher quality display copies.  Mezzanine copies are often created for discerning users of time-based media (i.e. a professor of music needs to hear tones that a streaming copy provided on the internet doesn’t allow you to hear).  These copies often larger files that require significant time and compute power to generate.

[8] ISO 14721:2012 Page 1-14

[9] Digital Library Program Project: Glossary of Terminology https://wiki.service.emory.edu/display/DLPP/Glossary+of+Terminology#GlossaryofTerminology-DescriptiveMetadata

[10] Digital Library Program Project: Glossary of Terminology https://wiki.service.emory.edu/display/DLPP/Glossary+of+Terminology#GlossaryofTerminology-TechnicalMetadata