Preparing Digitized Book Collections for Ingest

Preparing Digitized Book Collections for Ingest

Note: this documentation is still under development; additional sections are forthcoming.

Overview

The book import process includes the following steps, some of which will require assistance from the LTDS team:

  1. Preparation of pull-list spreadsheet with metadata and file paths per volume

  2. Export of Alma records for all books/serials in the collection

  3. Preparation of Collection-level metadata spreadsheet

  4. File transfer of all needed files, using the directory structure recorded in the pull-list filepaths

  5. Curate bulk import process

Metadata Preparation

Digitized books utilize metadata from two sources: the original pull-list spreadsheet used for digitization reviews as well as Alma catalog records.

 The table below lists Pull-list metadata fields/columns which are required for ingest into the repository.

 The following are also required in books and serials’ Alma records for ingest into the repository:

  • Title

  • Date Issued or Date Created

Reformatting Pull-List Spreadsheets for Curate Ingest

The following spreadsheet template shows the required formatting for a Curate-ready pull-list. While the pull-lists prepared during the digitization and review process may vary, the following spreadsheet columns are required for Curate's bulk import method. For information about metadata requirements, see the Cor Metadata Field Usage documentation.

Note: additional metadata is also extracted from Alma/MARC catalog records; the following fields are recommended for the pull-list itself.

* Required pull-list fields are indicated with an asterisk.

Some file path related column heading names may vary in the original pull list depending on what it was created.

Pull-list Heading(s)

Heading for Importer

Explanation

Pull-list Heading(s)

Heading for Importer

Explanation

Item Number

Item ID

A numeric ID for each individual work in the spreadsheet(e.g. the original row number). Recommended for cross-referencing across pull-list versions later.

N/A

source_collection_id

This will be populated by the ingest team once the Collection has been provisioned in Curate.

N/A

Non-unique Title

Indicate "Yes" if the title is known to have multiple copies, editions, or child volumes: this helps the ingest team create parent-child works later.

N/A

deduplication_key*

A unique ID for each individual volume in the collection; typically an ARK or barcode number. This will be added  by the ingest team.

OCLC Number, Barcode, DigWF ID

other_identifiers

concatenated list of other local identifiers e.g. barcode, digwf ID, OCLC #, etc. Identifiers should contain a prefix indicating their type, and multiple values should be separated by pipes

PID

emory_ark

Emory ARK id, if applicable

MMS ID or Alma MMSID

ALMA MMSID*

Alma MMSID for the catalog record from which additional metadata will be extracted during import. This field is required by the importer. See also the system_of_record_ID notes below.

MMS ID or Alma MMSID

system_of_record_ID

Copy of the Alma ID, to be stored as metadata in Curate. The prefix "alma:" should be added to each ID.

Institution

institution*

Name(s) of institutions providing the material, e.g. Emory University

Holding Repository

holding_repository*

Name of Library providing the material

Administrative Unit

administrative_unit

Name of administrative unit within the Library, if applicable

Call Number

CSV Call Number

The call number will be supplied from Alma, but it is useful to have this on the pull-list for reference. 

Enumeration

Enumeration

Volume-level enumeration, if applicable (e.g. Volume 1, Copy 1, Edition etc.)

CSV Title

CSV Title

Title will be supplied from Alma, but it is useful to have this on the pull-list for reference. 

Content Type

content_type*

Supplied as URI. Recommended value: http://id.loc.gov/vocabulary/resourceTypes/txt

Rights - Public Note/MARC 590 Field

emory_rights_statements*

The Emory Libraries supplied rights statement

Rights - Internal Note

internal_rights_note

Additional internal rights notes or documentation

Desc - RightsStatement.org Designation (URI)

rights_statement*

Supplied as URI from rights statement.org values, e.g. http://rightsstatements.org/vocab/NoC-US/1.0/

Visibility

visibility*

See available access controls (Public, Public Low View, Emory Low Download, Rose High View, Private)

Data Classification

data_classifications*

Emory defined data classification type: Public, Confidential, Internal, Restricted

Sensitive/Objectionable Material

sensitive_material

Indicate "Yes" if the volume contains sensitive material

Sensitive/Objectionable Material Note

sensitive_material_note

Provide additional context for any sensitive material determination

Transfer Engineer

transfer_engineer

The name of the digitization technician

Barcode

Barcode*

This is used to generate certain volume-level filenames. The barcode number should also be added to other_identifiers with the prefix "barcode:"

Base Path

Base_Path*

The base directory path where content files are stored on the server

Mbytes or MB Size

MBytes*

The overall file size for all content files in the work

pdf_path or PDF Path or PDF_Path

PDF_Path**

The base directory path for volume-level PDF file for the work

PDF Count or PDF_Cnt

PDF_Cnt**

The count of PDF files to be imported

XML Path, xml_path, OCR Path, our_path

OCR_Path**

The base directory path for volume-level OCR file for the work 

OCR Count, OCR_Cnt, XML Count

OCR_Cnt**

The count of volume-level OCR files to be imported

Images Path, TIFF Path, Disp_Path

Disp_Path*

Directory containing the page level image files (TIFFs) > Primary Content: Preservation Master File

TIF Count, Images Count, Disp_Cnt

Disp_Cnt*

The count of page-level image files to be imported

Txt Path, Text Path, Txt_Path

Txt_Path**

Directory containing the page level plain text files > Primary Content: Transcript File

Txt Count, Txt_Cnt, Text Count

Txt_Cnt**

The count of page-level text files to be imported

POS Path, POS_Path

POS_Path**

For Kirtas outputs: directory containing the page level POS files > Primary Content: Extracted Text File

POS Count, POS_Cnt

POS_Cnt**

For Kirtas outputs: count of page level POS files to be imported

ALTO Path, ALTO_Path

ALTO_Path**

For LIMB outputs: directory containing the page level Alto XML files > Primary Content: Extracted Text File 

ALTO Count, ALTO_Cnt

ALTO_Cnt**

For LIMB outputs: count of page-level ALTO xml files to be imported

METS Path, METS_Path

METS_Path**

For LIMB outputs: directory for volume-level METS file to be imported

METS Count, METS_Cnt

METS_Cnt**

For LIMB outputs: count of volume-level METS file to be imported

Rights - Digitization Basis

Accession.workflow_rights_basis

Rights basis determination (e.g. Public Domain) for digitization

Rights Access Basis - Review Date

Accession.workflow_rights_basis_date

Date of rights review (EDTF format)

N/A

Accession.workflow_rights_basis_reviewer

Name of individual or office performing rights review

Rights Access Basis - Note

Accession.workflow_rights_basis_note

Rights-related notes about digitization/preservation

Rights - Digitization Basis - Note

Accession.workflow_notes

General notes about digitization/preservation or aquisition

N/A

Ingest.workflow_rights_basis

Rights basis determination (e.g. Public Domain) for ingest and access level

N/A

Ingest.workflow_rights_basis_date

Date of rights review (EDTF format)

N/A

Ingest.workflow_rights_basis_reviewer

Name of individual or office performing rights review

N/A

Ingest.workflow_rights_basis_note

Rights-related notes about ingest or migration

Ingest/Migration Event Note

Ingest.workflow_notes

General notes about ingest or migration, e.g. Migrated to Cor repository from LSDI Kirtas workflow during Phase 1 Migrations, 2019

** Required for import, depending on digitization output.

Additional Preparation Steps

It is strongly recommended to sort the pull-list CSV by the title column prior to submitting it for ingest. This helps the repository ingest team to identify multiple editions of the same work as well as parent-child relationships.

If a Collection is being split into multiple pull-lists, please identify whether the Title is known to have other copies, editions, or child volumes so that the repository ingest team can be aware of this in the future. This can be done by indicating "Yes" in the Non-Unique Title column.

Information about Metadata Extracted from Alma Records

As noted above, the pull-list provides certain metadata for the repository, but additional fields are extracted from MARC records exported from Alma:

  • conference_name

  • contributors

  • copyright_date

  • creator

  • date_created

  • date_digitized

  • date_issued

  • edition

  • extent

  • content_genres

  • local_call_number

  • place_of_production

  • primary_language

  • publisher

  • series_title

  • subject_geo

  • subject_names

  • subject_topics

  • table_of_contents

  • title

  • uniform_title

For more information about MARC mappings and field reformatting, see the MARC to Cor mapping worksheet.

Filename Conventions for Bulk Import

The Curate bulk-import process is optimized to work with the following filename conventions in use within digitized book collections. If your collection's files use a different convention, please contact LTDS for support.

Volume-Level Files

The Curate book import preprocessor makes the following assumptions:

  • Kirtas outputs expect that a filename is supplied in the CSV, using "Output" as the base filename for the volume-level PDF and OCR files:

    • Output.pdf

    • Output.xml

  • LIMB outputs will not have an explicit filename supplied in the CSV, and instead will generate a filename using the barcode number for the volume as the filename for the volume-level PDF and METS files:

    • [Barcode#].pdf

    • [Barcode#].mets.xml

Page-Level Files

While file naming practices may vary, it is strongly recommended that all filenames contain or end with a numeric part sequence, such as "0001.tif". The Curate book import preprocessor makes the following assumptions about page-level files:

  • Kirtas filenames have 4 digits using 0 as padding (0001.tif, 0085.tif, etc. )

  • LIMB filenames have 8 digits using 0 as padding (00000001.tif, 00000085.tif, etc.)

Some file sequences start with zero, some with one. This should be identified as part of the collection preparation process.

Works whose filename sequences include an additional prefix such as an OCLC number should also be identified as part of the collection preparation process.