...
The DP-FRG reviewed the Preservation Storage Criteria Version 2 document and identified that top thirteen most important criteria based on their knowledge of digital preservation. They then ranked these criteria based on their importance, with a ranking of one being the most important. The DP-FRG recommends taking these criteria into consideration when choosing a storage device or service.
Ranking | Criteria |
1 | Provides integrity checks |
2 | Secure |
3 | Supports expansion |
4 | Monitoring |
5 | Cost-efficient |
6 | Self-healing transparency |
7 | High resilience |
8 | Non-disruptive storage migrations |
9 | Geographic separation |
10 | File system limits |
11 | Designed for zero data loss |
12 | Scalable to large data sizes |
13 | Supports a global namespace |
...
Below are the definitions of the above terms as provided by the Preservation Storage Criteria Version 2 document.
Provides integrity checks. Performs verifiable and/or auditable integrity checking as part of the preservation storage
Secure. Includes safeguards, data security and documented procedures to prevent security incidents related to hardware, software, personnel, and physical structures, areas and devices.
Supports expansion. Can increase storage over time as needed
Monitoring. Supports ability to observe or check activity in the storage infrastructure (e.g. see activity in real-time, examine logs, observe the performance status, determine the overall status or drill-down into activities)
Cost-efficient. Costs relatively less than other more expensive solutions per GB, by being designed with cost efficiencies, for example, has resource pooling and sharing, multi-tenancy (multiple users share the same applications)
Self-healing transparency. Systems that use mechanisms to correct altered data (like bit corruptions) do so in a transparent, documented manner.
High resilience. Has high resilience, which is the ability to adapt under stress or faults (e.g. resilient to equipment failures, power outages, attacks, surges in user demand)
Nondisruptive storage migrations. Allows for storage tier changes over time (without disruption to availability)
Geographic separation. Ensures multiple redundant copies in geographically-separate locations for protection from catastrophic loss.
File system limits. Able to support long file, path or directory names; large amount of files in a directory, diverse character encodings.
Designed for zero data loss. Error detection and correction 24/7/365 (e.g. using RAID, Erasure coding, ZFS, triple copies/rebuild)
Scalable to large data sizes. Able to support very large amounts of content, e.g. multiple PBs of data, hundreds of millions of files and directories, terabyte size files
Supports a global namespace. Nothing limits ability to have a global (i.e. consolidated) view of files.
Preservation System Selection Criteria
While reviewing the Preservation Storage Criteria Version 2 document, DP-FRG felt some of the criteria outlined should be performed by the preservation repository itself, rather than by the storage devices and services the preservation repository utilizes. Those criteria are:
Provides preservation actions. Provides tools and/or services to support digital preservation actions (e.g. fixity checking, migration, auditing processes) as part of the preservation storage
Recovery. Has documented ability to replace any corrupt/bad file, file system, or large-scale set of files in reasonable/expected/negotiated timeframes
Access controls. Provides role-based, access controls for storage infrastructure, e.g. user, staff, admin, to ensure only the appropriate people have the appropriate levels of access
Complete exports. Supports the bulk exporting of content and metadata for any reason, at an acceptable rate, for example, as part of an exit strategy
...
One of the main preservation methods a repository employs to ensure integrity over time is bit-stream copying; by keeping multiple preservation copies of a digital object a repository ensures content remains unchanged over time. Furthermore, bit-stream copying ensures that if a digital object does become corrupt, the repository has another preservation copy to guarantee the bit-stream’s safety.
For the purposes of the forthcoming preservation repository, a preservation copy is defined as:
A single instance of a bit-stream created by the preservation repository and stored on a storage device or with a storage service.
For example, the preservation repository may generate a preservation copy and store it on a local storage device within the LITS data center. The storage device may have a service that backs up the preservation copy to another location; however, this backup does not count as a preservation copy. By the same logic, mirroring of storage devices for the purposes of fail-safe do not count as a second preservation copy.
Likewise, if the preservation repository generates a preservation copy and sends that copy to a service like the Digital Preservation Network (DPN)[2], the preservation repository has only generated a single preservation copy of the digital object. Even though DPN generates multiple copies as part of their service, their copies are not considered to be preservation copies by the preservation repository.
...
As mentioned above, bit-stream copying is deemed to be one of the most basic ways a repository meets preservation obligations. The consensus among the preservation community is to keep at least three preservation copies of a digital object. If only one preservation copy is kept, the preservation repository would risk corruption of the digital object. If only two preservation copies are kept, it might become difficult for the preservation repository to determine which preservation copy is the “true copy” if one of them becomes corrupt. By keeping at least three preservation copies, the preservation repository ensures there is always an two preservation copy to poll should one copy become corrupt.
In addition to keeping three preservation copies, the preservation community agrees that utilizing diverse storage infrastructure for each preservation copy is important. This means that all preservation copies cannot be stored on local Isilon Storage, nor can they all be stored in Amazon. Instead the preservation repository should choose storage devices and services based on their technology diversity.
...