Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Note: this documentation is still under development; additional sections are forthcoming.

Overview

The book import process includes the following steps, some of which will require assistance from the LTDS team:

  1. Preparation of pull-list spreadsheet with metadata and file paths per volume
  2. Export of Alma records for all books/serials in the collection
  3. Preparation of Collection-level metadata spreadsheet
  4. File transfer of all needed files, using the directory structure recorded in the pull-list filepaths
  5. Curate bulk import process

Metadata Preparation

Digitized books utilize metadata from two sources: the original pull-list spreadsheet used for digitization reviews as well as Alma catalog records.

Rights Statement - (Desc - RightsStatement

 The following table below lists Pull-list metadata fields/columns which are required for ingest into the repository:

  • Holding Repository
  • System of Record ID (Alma MMSID)
  • Content Type
  • Emory Rights Statement (Rights - Public Note)
  • .

    org Designation (URI))
  • Data Classifications
  • Visibility
  • Institution

     The following metadata fields are also required in books/serials’ Alma records for ingest into the repository:

    • Title
    • Date Issued/Date Created

    Reformatting Pull-List Spreadsheets for Curate Ingest

    The following spreadsheet template shows the required formatting for a Curate-ready pull-list. While the pull-lists prepared during the digitization and review process may vary, the following columns are required for Curate's bulk import method. For information about metadata requirements, see the Cor Metadata Field Usage documentation.

    Note: additional metadata will also be extracted from Alma/MARC catalog records; the following fields are recommended for the pull-list itself.

    * Required pull-list fields are indicated with an asterisk.

    Column HeadingExplanation
    Item IDA numeric ID for each individual work in the spreadsheet(e.g. the original row number). Recommended for cross-referencing across pull-list versions later.
    deduplication_key*A unique ID for each individual volume in the collection; typically an ARK or barcode number
    other_identifiersconcatenated list of other local identifiers e.g. barcodes, digwf IDs, OCLC, etc. Identifiers should contain a prefix indicating their type, and multiple values should be separated by pipes
    emory_arkEmory ARK id, if applicable
    system_of_record_ID*Alma MMSID
    institution*Name(s) of institutions providing the material, e.g. Emory University
    holding_repository*Name of Library providing the material
    administrative_unitName of administrative unit within the Library, if applicable
    CSV Call NumberThe call number will be supplied from Alma, but it is useful to have this on the pull-list for reference. 
    EnumerationVolume-level enumeration, if applicable (e.g. Volume 1, Copy 1, Edition etc.)
    CSV TitleTitle will be supplied from Alma, but it is useful to have this on the pull-list for reference. 
    content_type*Supplied as URI. Recommended value: http://id.loc.gov/vocabulary/resourceTypes/txt
    emory_rights_statements*The Emory Libraries supplied rights statement
    internal_rights_noteAdditional internal rights notes or documentation
    rights_statement*Supplied as URI from rights statement.org values, e.g. http://rightsstatements.org/vocab/NoC-US/1.0/
    visibility*See available access controls (Public, Public Low View, Emory Low Download, Rose High View, Private)
    data_classifications*Emory defined data classification type: Public, Confidential, Internal, Restricted
    sensitive_materialIndicate "Yes" if the volume contains sensitive material
    sensitive_material_noteProvide additional context for any sensitive material determination
    transfer_engineerThe name of the digitization technician
    date_digitizedThe date of digitization for the volume (EDTF format)
    Barcode*This is used to generate certain volume-level filenames
    Base_Path*The base directory path where content files are stored on the server
    MBytes*The overall file size for all content files in the work
    PDF_Path**The base directory path for volume-level PDF file for the work
    PDF_Cnt**The count of PDF files to be imported
    OCR_Path**The base directory path for volume-level OCR file for the work 
    OCR_Cnt**The count of volume-level OCR files to be imported
    Disp_Path*Directory containing the page level image files (TIFFs) > Primary Content: Preservation Master File
    Disp_Cnt*The count of page-level image files to be imported
    Txt_Path**Directory containing the page level plain text files > Primary Content: Transcript File
    Txt_Cnt**The count of page-level text files to be imported
    POS_Path**For Kirtas outputs: directory containing the page level POS files > Primary Content: Extracted Text File
    POS_Cnt**For Kirtas outputs: count of page level POS files to be imported
    ALTO_Path**For LIMB outputs: directory containing the page level Alto XML files > Primary Content: Extracted Text File 
    ALTO_Cnt**For LIMB outputs: count of page-level ALTO xml files to be imported
    METS_Path**For LIMB outputs: directory for volume-level METS file to be imported
    METS_Cnt**For LIMB outputs: count of volume-level METS file to be imported
    Accession.workflow_rights_basisRights basis determination (e.g. Public Domain) for digitization
    Accession.workflow_rights_basis_dateDate of rights review (EDTF format)
    Accession.workflow_rights_basis_reviewerName of individual or office performing rights review
    Accession.workflow_rights_basis_noteRights-related notes about digitization/preservation
    Accession.workflow_notesGeneral notes about digitization/preservation or aquisition
    Ingest.workflow_rights_basisRights basis determination (e.g. Public Domain) for digitization/preservation
    Ingest.workflow_rights_basis_dateDate of rights review (EDTF format)
    Ingest.workflow_rights_basis_reviewerName of individual or office performing rights review
    Ingest.workflow_rights_basis_noteRights-related notes about ingest or migration
    Ingest.workflow_notesGeneral notes about ingest or migration, e.g. Migrated to Cor repository from LSDI Kirtas workflow during Phase 1 Migrations, 2019

    ** Required for import, depending on digitization output.

    Filename Conventions for Bulk Import

    The Curate bulk-import process is optimized to work with the following filename conventions in use within digitized book collections. If your collection's files use a different convention, please contact LTDS for support.

    Volume-Level Files

    The Curate book import preprocessor makes the following assumptions:

    • Kirtas outputs use "Output" as the base filename for the volume-level PDF and OCR files:
      • Output.pdf
      • Output.xml
    • LIMB outputs use the barcode number for the volume as the filename for the volume-level PDF and METS files:
      • [Barcode#].pdf
      • [Barcode#].mets.xml

    Page-Level Files

    While file naming practices may vary, it is strongly recommended that all filenames contain or end with a numeric part sequence, such as "0001.tif". The Curate book import preprocessor makes the following assumptions about page-level files:

    • Kirtas filenames have 4 digits using 0 as padding (0001.tif, 0085.tif, etc. )
    • LIMB filenames have 8 digits using 0 as padding (00000001.tif, 00000085.tif, etc.)

    Some file sequences start with zero, some with one. This should be identified as part of the collection preparation process.

    Works whose filename sequences include an additional prefix such as an OCLC number should also be identified as part of the collection preparation process.



    Page Contents:

    Table of Contents