Metadata Management: Systems of Record for the DLP Repository

Metadata Management: Systems of Record for the DLP Repository

Metadata Management: Systems of Record for the DLP Repository

Prepared by: DLP Metadata Implementation Working Group

Revised: March 2018

Status: Approved





Overview

The charge for the Metadata Implementation Working Group (M-IWG) of the Digital Library Program’s Discovery Phase includes a deliverable for:

...develop [a] metadata management strategy identifying canonical systems of record for descriptive metadata workflows

In response to the charter task, this document provides an analysis of current state repository-related metadata workflows and system interactions for Descriptive metadata, as well as for other major segments of metadata for the digital repository context. Due to significant dependencies related to current state system integrations, active metadata platform migrations, and larger organizational policies under development, MIWG has not provided specific system-level recommendations for Descriptive metadata. Broader recommendations for future state interactions with the repository (including additional types of metadata beyond Descriptive metadata) are identified, however, in the Recommendations for Future State section of this document, and are also summarized below.

Summary of Recommendations

The DLP repository should serve as the system of record for creating/editing:

  • Preservation Events/Workflows Metadata

  • Technical Metadata

  • Administrative metadata for repository-specific workflows

  • Structural metadata for repository object structures

  • Descriptive Metadata for self-deposit by end users

  • Rights metadata for digital objects stored in the repository

Formalization of systems of record supplying the DLP repository should be determined for:

  • Descriptive Metadata created and managed by Library Staff

DLP implementation teams, in partnership with content stewards and metadata providers, should develop both short term and longer term strategies to assist with forthcoming migrations of data to the DLP environment. Development of longer-term strategies requires engagement with appropriate Emory Libraries committees such as the Metadata Strategy Group or other Emory Libraries initiatives supporting large-scale metadata management strategy.

Current State Analysis

Workflows and Systems

The following current state systems’ workflows were included in this analysis, based upon the original DLP business case and systems noted in the approved Digital Library Program Phase 1 Scope Statement: The Keep, OpenEmory, ETDs, Dataverse, DAMS, and Digitized Books. The following current state workflows were reviewed, based on their starting point of deposit, as key indicators for repository metadata management areas:

  1. Disk images and born digital files: staff deposited material

  2. Digitized books: staff deposited material

  3. Digitized audio: staff deposited material

  4. Digitized video: staff deposited material

  5. Digitized still images: staff deposited material

  6. ETDs: self deposited material

  7. Research data sets: self deposited material

  8. OpenEmory: staff deposited, self deposited and harvested materials

The following local and external systems/applications were identified as intersecting with Fedora repository workflows involving exchange of metadata, either automated or manual:

  1. The Keep

  2. Emory Finding Aids

  3. ETDs (new)

  4. OpenEmory

  5. Emory FIRST/Symplectic Elements

  6. PubMed

  7. Primo (receipt of OAI outputs)

  8. Emory Shared Data

  9. Additional custom scripting/integrations that extend (but are not standalone) the above systems

The following systems/applications are not actively integrated with the Fedora repository currently, but are either anticipated for potential future integration, or will contain metadata for content designated as in scope for the repository, based on the DLP Discovery Phase Scope Statement:

  1. Digitized Books

  2. LIMB

  3. DigWF app

  4. Alma

  5. DAMS

  6. Dataverse

  7. ArchivesSpace

Summary of Activity for Metadata Types

In the pre-DLP repository landscape, multiple systems are utilized to create and manage distinct types of metadata, with the largest amount of variety occuring in the management of Descriptive metadata. Visual diagrams of current-state metadata flows are available in a separate document.

Descriptive Metadata

Descriptive metadata creation and editing currently occurs both inside and outside the repository context, which can result in duplicative data entry and orphaned copies of records that are out of sync with one another. Descriptive metadata management in some cases occurs on a field by field basis, where data for some fields within a record are imported from an external source, some are imported but then override-able, and others are only created/populated within the repository environment. Systems and data sources currently include Alma, The Keep, EmoryFindingAids, DAMS, ETDs, OpenEmory (including PubMed, Emory FIRST), Emory Shared Data, and Dataverse.

Snapshot of current state creation environments for Descriptive metadata

Preservation Metadata

Preservation metadata, such as PREMIS events/audit trails and other digital object properties, is exclusively created and managed by the current state Fedora 3 applications for which it is enabled: The Keep, Open Emory, and ETDs. For legacy ETDs, PREMIS events are generated; in the new ETD application, minimal audits and logs are tracked. This type of metadata records preservation activities occurring upon digital objects, particularly when those activities are automatically performed by an application. Note: while the DAMS has placeholder fields for recording checksums, the system itself does not perform active preservation actions, so it is not considered to record this category of metadata.

Source Metadata

Source metadata, which records information about the original source material from which a digital surrogate is derived, is currently recorded multiple locations (The Keep, the DAMS, Digitized Books/Alma). Because this metadata focuses on specialized digitization or reformatting activity, it is not recorded for self-deposit workflows such as ETDs, OpenEmory, or Dataverse in which digital files are provided directly by the content creators. Source metadata is generally static by nature, in that it should not require changes over time unless an object is re-digitized or re-processed for the repository. In those cases, new source metadata would be created to describe the production of the new surrogate.

Technical/Characterization Metadata

The creation of Technical metadata, which describes technical characteristics of individual digital files, currently occurs in multiple applications, both outside the repository (DAMS and/or other standalone tools) and directly integrated with the repository (The Keep). This type of metadata is extracted automatically from a digital file via specialized software and is not intended to be manually edited, though it may periodically be re-generated or refreshed. In some cases, this metadata is stored as system actionable metadata, but in other cases is treated as a binary file for downloading (not integrated/indexed within the repository application’s display interface). Utilization of this type of metadata varies across workflows: in some instances robust characterization metadata is provided (DAMS, The Keep), in others, minimal digital file properties are identified such as mimetype, filesize, filename (ETDs, Dataverse, OpenEmory).

Administrative/Workflow Metadata

This metadata typically is created and managed relative to a specific application and workflow, and primarily supports staff needs, such as internal notes related to ingest or information about the metadata itself. It does not typically display to end users, nor is it typically migrated beyond its originating application, unless the administrative data is significant for long term preservation activities. This type of metadata is currently utilized in all of the in-scope systems identified in the DLP Discovery Phase.

Rights Metadata

For the DLP Discovery Phase, rights metadata has been delineated as information recording the current rights status of a digital object (such as copyright, license, or other status), information supporting rights determination activities, and information recording the rights for which repository activities are conducted. These different components of rights metadata are variably treated as Descriptive metadata, Administrative metadata, or Preservation metadata. Rights metadata is currently recorded in all repository-related systems.  

Structural Metadata

This metadata records information about the internal structure of a digital object and how its component parts such as attached files relate to each other, as well as how repository assets relate to each other within the preservation repository (i.e. relationships of objects to collections and other objects stored in the repository). Current examples of this include METS/ALTO files generated by LIMB software for digitized books, and PCDM (Portland Common Data Model) relationship metadata automatically generated by the repository for the new ETDs application. Note: the Fedora 3 repository environment currently records RELS-EXT metadata which tracks external relationships between objects and collections.

Recommendations for Future State

Systems of Record

For the long-term, the M-IWG recommends formally designating selected metadata environments as canonical systems of record to supply metadata to the repository for specific purposes. A given system of record will hold the authoritative/canonical version of the data, and is characterized as providing metadata creation, editing, and standards-friendly data export capabilities. Systems of record are often utilized by more than one Library or Emory business unit.

By designated selected systems as canonical, we can prioritize building fewer system integrations and editing tools overall, reduce redundant data entry required by staff, and enhance the accuracy of metadata displayed in the repository. While the DLP repository environment will provide metadata editing capabilities as needed to support self-deposit workflows and repository-related administrative and preservation workflow information, external systems of record for Descriptive metadata will generally provide more optimized functionality for their particular metadata domain.

When the DLP repository itself is identified as the system of record, it is also important to note that in some cases, metadata will be assigned using URIs that point to content or authorities’ data which is externally hosted (e.g. RightsStatements.org, Creative Commons license definitions, or name authorities). In these situations, the repository will serve as the point of assignment of and recording references to said data, but is not in fact archiving those external sources in their entirety.  

Based on the complexity of our current state practices and platforms, multiple systems of record will continue to be required to address business needs. Descriptive metadata is a particular area of focus for long-term consideration, given the variety of current state activities and the general nature of Descriptive metadata to be more frequently updated than other types.

As of March 2018, the ArchivesSpace migration project is beginning its initial phase, and long term management of archival finding aids metadata is not yet finalized, nor are its capabilities with regard to systems of record fully explored and understood. Larger organizational decisions regarding the creation and management of item-level metadata for repository objects will also impact future usage of finding aids metadata as well as the DAMS. Additionally, the Symplectic Elements software which supplies metadata to OpenEmory does not yet provide support for depositing into Fedora 4, which will impact DLP migration and management strategies for metadata in that workflow.

Due to these dependencies, M-IWG has been unable to pursue detailed, individual system-level recommendations, but in the sections that follow, have outlined broader goals and recommendations for future work, some of which may extend beyond the scope of the Digital Library Program. M-IWG has also incorporated work produced by the Digital Preservation Functional Requirements Group and Digital Collections Steering Committee’s Policy Task Force regarding metadata requirements for long term preservation.  

Canonical Environments: DLP Repository-Relative Context

The following table indicates recommendations for internal or external metadata management relative to the DLP repository itself, for major categories of repository metadata:

Metadata Type

Canonical System for Creating/Editing

DLP Repository Context

Descriptive Metadata

DLP Repository

Multiple external

Canonical (for self deposit workflows)1
Receives Actionable Data for Preservation

Preservation Events

DLP Repository

Canonical (for events generated in the DLP repository)2

Technical Metadata

DLP Repository

Canonical

Administrative Metadata

DLP Repository

Canonical (for DLP-managed workflows)

Structural Metadata

DLP Repository

Canonical for DLP-managed object structures (PCDM)

Receives Static Copy for Preservation3

Source Metadata

N/A4

Receives Static Copy for Preservation4

Rights Metadata

DLP Repository5

Canonical for Repository Digital Objects

1 In self-deposit workflows where the content creator is supplying original metadata, the DLP Repository may serve as the canonical system for Descriptive metadata.

2 Legacy and additional supplemental preservation metadata may be deposited as supplemental files

3 Structural metadata may also be deposited as supplemental files

4 Based on recommendations from the Digital Preservation FRG, Source metadata will be ingested as static supplemental files. These may be prepared by a variety of local methods not subject to large-scale management.

5 Rights metadata may exist separately for physical objects; the repository itself is canonical for digital surrogates.

Canonical Systems: Near-Term Recommendations

While longer-term management strategies are determined, the migration of metadata into the repository will need to occur via more static methods for the near-term. The following options may be pursued for feasibility in the DLP implementation and migration phases:

  1. Explore integrations for established Library-wide metadata systems (e.g. Alma for digitized books).

  2. Prepare short-term, static ingest methods for staff-created metadata that does not have a system of record identified, while building placeholders to enable future harvest and synchronization.

  3. Prepare self-deposit tools for approved workflows (e.g. student/researcher submissions); extend to other campus submitter scenarios as approved.

  4. Record any staff-created metadata to be stored as Supplemental Files (e.g. Source Metadata) in a non-proprietary machine-readable structure such as XML or CSV whenever possible, in case the data may ever need to be extracted in the future.  

Canonical Systems: Long Term Recommendations

As the M-IWG sunsets in 2018, these longer-term recommendations may be pursued through future activities in appropriate chartered groups such as the Metadata Strategies Group or other Emory Libraries initiatives such as the Discovery Task Force.

M-IWG proposes the following long term recommendations  for repository-related metadata management:

  1. Canonical systems for Descriptive metadata that supply the repository should be identified, utilized, and integrated with the repository for all major deposit workflows.

  2. When canonical systems are designated and integrated, the following best practices should be implemented:

    1. Metadata creation and editing should occur only in designated canonical systems/tools to remove redundant effort and prevent unsynchronized copies. Multiple working copies of metadata should not be created unless there is a critical business need to do so.

    2. For canonical systems external to the DLP repository, copies of metadata should be harvested and regularly refreshed.

    3. Identifiers will be a critical component for management across environments: when importing and exporting metadata, the DLP repository must record any canonical system identifiers for a given workflow to keep data in sync.

    4. De-duplication strategies across multiple systems of record must be identified as our larger discovery strategy is developed and the new repository begins to disseminate content (e.g. feeds of data into Primo).

    5. Strategies for enrichment/augmentation should be identified for repository workflows, to determine if repository-contextual metadata enrichments should be synchronized back to their originating source.

Identification of Non-canonical Systems

To clarify the repository’s relationship to the broader metadata landscape at Emory and beyond, additional platforms may potentially harvest, read, repackage, or display metadata from the future DLP repository, but will not directly create, modify, or delete the values of the repository’s data. These systems are subject to change over time, but currently include:

  • Primo

  • LUNA

  • SharedShelf/JSTOR Forum

  • Readux

  • Additional upstream external metadata providers that interact with Alma, but are not anticipated to directly integrate with the repository (e.g. OCLC)

  • Other repositories/platforms that the Libraries disseminate metadata to, such as Internet Archive, HathiTrust, Georgia Knowledge Repository, Digital Library of Georgia, DPLA. Formal agreements regarding external dissemination will be guided by new policy proposed in 2018 by the Digital Collections Steering Committee.