Preservation Storage Recommendations

Overview

This document outlines preservation storage recommendations developed by the DLP Digital Preservation Functional Requirements Group (DP-FRG).  This document is intended to provide guidance to the Technology Implementation Working Group and future phases of the project when making storage decisions.

Preservation Storage Selection Criteria

To assist in outlining the DLPs criteria for Preservation storage, the DP-FRG utilized resources available to them via Preservation and Archiving Special Interest Group (PASIG); chief among them is a document entitled Preservation Storage Criteria Version 2[1].

Ranking Selection Criteria

The DP-FRG reviewed the Preservation Storage Criteria Version 2 document and identified that top thirteen most important criteria based on their knowledge of digital preservation.  They then ranked these criteria based on their importance, with a ranking of one being the most important.  The DP-FRG recommends taking these criteria into consideration when choosing a storage device or service.

Ranking

Criteria

1

Provides integrity checks

2

Secure

3

Supports expansion

4

Monitoring

5

Cost-efficient

6

Self-healing transparency

7

High resilience

8

Non-disruptive storage migrations

9

Geographic separation

10

File system limits

11

Designed for zero data loss

12

Scalable to large data sizes

13

Supports a global namespace

Definitions of the Criteria

Below are the definitions of the above terms as provided by the Preservation Storage Criteria Version 2 document. 

Provides integrity checks. Performs verifiable and/or auditable integrity checking as part of the preservation storage

Secure. Includes safeguards, data security and documented procedures to prevent security incidents related to hardware, software, personnel, and physical structures, areas and devices.

Supports expansion. Can increase storage over time as needed

Monitoring. Supports ability to observe or check activity in the storage infrastructure (e.g. see activity in real-time, examine logs, observe the performance status, determine the overall status or drill-down into activities)

Cost-efficient. Costs relatively less than other more expensive solutions per GB, by being designed with cost efficiencies, for example, has resource pooling and sharing, multi-tenancy (multiple users share the same applications)

Self-healing transparency. Systems that use mechanisms to correct altered data (like bit corruptions) do so in a transparent, documented manner.

High resilience. Has high resilience, which is the ability to adapt under stress or faults (e.g. resilient to equipment failures, power outages, attacks, surges in user demand)

Nondisruptive storage migrations. Allows for storage tier changes over time (without disruption to availability)

Geographic separation. Ensures multiple redundant copies in geographically-separate locations for protection from catastrophic loss.

File system limits. Able to support long file, path or directory names; large amount of files in a directory, diverse character encodings.

Designed for zero data loss. Error detection and correction 24/7/365 (e.g. using RAID, Erasure coding, ZFS, triple copies/rebuild)

Scalable to large data sizes. Able to support very large amounts of content, e.g. multiple PBs of data, hundreds of millions of files and directories, terabyte size files

Supports a global namespace. Nothing limits ability to have a global (i.e. consolidated) view of files.

Preservation System Selection Criteria

While reviewing the Preservation Storage Criteria Version 2 document, DP-FRG felt some of the criteria outlined should be performed by the preservation repository itself, rather than by the storage devices and services the preservation repository utilizes.  Those criteria are:

Provides preservation actions. Provides tools and/or services to support digital preservation actions (e.g. fixity checking, migration, auditing processes) as part of the preservation storage

Recovery. Has documented ability to replace any corrupt/bad file, file system, or large-scale set of files in reasonable/expected/negotiated timeframes

Access controls. Provides role-based, access controls for storage infrastructure, e.g. user, staff, admin, to ensure only the appropriate people have the appropriate levels of access

Complete exports. Supports the bulk exporting of content and metadata for any reason, at an acceptable rate, for example, as part of an exit strategy

Preservation Copy Recommendations

One of the main preservation methods a repository employs to ensure integrity over time is bit-stream copying; by keeping multiple preservation copies of a digital object a repository ensures content remains unchanged over time.  Furthermore, bit-stream copying ensures that if a digital object does become corrupt, the repository has another preservation copy to guarantee the bit-stream’s safety. 

For the purposes of the forthcoming preservation repository, a preservation copy is defined as:

A single instance of a bit-stream created by the preservation repository and stored on a storage device or with a storage service. 

For example, the preservation repository may generate a preservation copy and store it on a local storage device within the LITS data center.  The storage device may have a service that backs up the preservation copy to another location; however, this backup does not count as a preservation copy. By the same logic, mirroring of storage devices for the purposes of fail-safe do not count as a second preservation copy.

Likewise, if the preservation repository generates a preservation copy and sends that copy to a service like the Digital Preservation Network (DPN)[2], the preservation repository has only generated a single preservation copy of the digital object. Even though DPN generates multiple copies as part of their service, their copies are not considered to be preservation copies by the preservation repository.

Number of copies

As mentioned above, bit-stream copying is deemed to be one of the most basic ways a repository meets preservation obligations.  The consensus among the preservation community is to keep at least three preservation copies of a digital object.  If only one preservation copy is kept, the preservation repository would risk corruption of the digital object.  If only two preservation copies are kept, it might become difficult for the preservation repository to determine which preservation copy is the “true copy” if one of them becomes corrupt.  By keeping at least three preservation copies, the preservation repository ensures there is always an two preservation copy to poll should one copy become corrupt.

In addition to keeping three preservation copies, the preservation community agrees that utilizing diverse storage infrastructure for each preservation copy is important.  This means that all preservation copies cannot be stored on local Isilon Storage, nor can they all be stored in Amazon.  Instead the preservation repository should choose storage devices and services based on their technology diversity.

A Note on Sensitive Information

The DP-FRG did discuss some of the concerns around sensitive information, however policies at Emory around sensitive information at the time of writing were still unclear.  The DP-FRG has asked that the Library Technology and Digital Strategies Division work with relevant staff in LITS and the wider University to identify a path forward for preserving sensitive information.

 



[1] Preservation Storage Criteria, Version 2. May 2017. https://docs.google.com/document/d/1CEkcWskAbph0gQ4ATK_SWYw-6-8zBxnGFfqZiwtfpEI/edit

[2] https://dpn.org/about