Setting up a Repository for Harvest
Instructions for new hubs of the Mountain West Digital LibraryThe Mountain West Digital Library welcomes new hosting hubs to the network of digital collections about our region. These guidelines will help get you started. Please contact MWDL director Kinza Masood at firstname.lastname@example.org for more information or to request to join MWDL.
Offering Your Repository for HarvestFor information about checking your repository's OAI stream, see Open Archives Initiative (OAI) Queries.
What we need to know
- The baseURL of the repository's OAI provider, including the port, if specified.
- The metadata format of the records, something like "oai_dc", "qdc", or "oai_qdc". We prefer a format that validates against the dublincore.org schema for Qualified Dublin Core. See http://dublincore.org/schemas/xmls/ for more information about the currently recommended Dublin Core XML schemas.
- The setSpec and full name of each collection you wish to have harvested, along with the full name of the collection partner that manages each set (if you are hosting collections for other collection partners). We suggest you start with a pilot group of 3-5 collections that are representative of the range of kinds of records you have in the repository. Once we harvest the pilot collections successfully, we can move on to include the rest of the collections you wish to share.
How we use OAI to Harvest
- Our harvesting system in Ex Libris Primo sends a standard "Identify" request first to verify that the OAI repository is functioning.
- Then it sends a "ListRecords" request with "from" and "until" parameters to obtain the first batch of metadata records from the repository.
- Additional "ListRecords" requests with appropriate "resumptionToken" parameters, are sent as needed to get the full listing of records.
- We run a number of normalization routines on the harvested records to transform the Dublin Core metadata into Primo normalized XML.
- Normally, we do a full initial harvest only once and then do incremental harvests weekly, using the "from" and "until" parameters on the weekly "ListRecords" request.
IP Addresses and System AdministrationIf your systems admininstrators have the practice of keeping whitelists for access to your OAI provision, the address they will need to add to ensure that we will be able to harvest your collections is 18.104.22.168.
Providing Metadata using the Open Archives Initiative ProtocolMWDL can harvest any OAI-compliant stream. The requirements for the OAI protocol for metadata harvesting are spelled out at http://openarchives.org.
A few Notes
- We recommend that you use a digital assets management system that includes built-in OAI metadata provision that is easy to configure.
- In CONTENTdm, on the server configuration tab, ensure that the "Enable OAI" setting is set to "Yes." Ensure that the "Enable compound object pages" setting is set to "No." We want to harvest your metadata at the object level only.
- Our default metadata format for harvest is Qualified Dublin Core (often, "oai_qdc"). We can accept simple Dublin Core ("oai_dc") if need be, although it sacrifices a lot of the metadata complexity. We can share only what your OAI stream shares for harvest. Please ensure your metadata format validates against a schema from the Dublin Core Metadata Initiative, not a proprietary vendor schema. See http://dublincore.org/schemas/xmls/ for more information about the currently recommended Dublin Core XML schemas.
- If you have a repository that does not have built-in OAI metadata provision, please implement one of the many open-source or low-cost OAI provider tools. We strongly advise against creating your own OAI provider module. The OAI protocol seems simple and straightforward, but it has multiple functions that must be implemented precisely, and it is more complicated and time-consuming to program and test than it initially appears. Please understand that we do not have time to assist in testing of "home-grown" OAI providers.
- Take advantage of the OAI sets implementation to separate the different collections. MWDL can harvest separate sets, using the setSpec assigned to each item to separate the collections. The setSpec should be reflected in the OAI identifier, so that we can retrieve the setSpec from there. Typically, we harvest all records and tag only certain sets for display, i.e., the sets that you submit to us. If you do not implement sets, we have no option other than to harvest your entire repository and represent it as one collection.
- We recommend that you implement OAI deleted record status. This is not required of OAI repositories but, without it, we have no way to remove from our harvester the records that you delete locally, except by a full delete-and-reload of your entire repository, which we prefer not to do (very often).
- While any system for assigning unique identifiers is acceptable with the OAI protocol, we recommend you generate a meaningful OAI identifier that is related to the setSpec and item number in your digital assets management repository. An example of such a identifier is "oai:images.archives.gov:ead/6", where "images.archives.gov" is the domain, "ead" is the setSpec and "6" is the item number. This makes it easy for our harvester to identify the collection each item belongs to and to create links to the items in your repository without having to resort to the <dc:identifier> field.
- Please test your OAI provider to ensure it conforms to the OAI protocol before offering it for harvest. Fix any issues identified by these tools.
- Validator at the Open Archives Initiative site at https://www.openarchives.org/Register/ValidateSite
- Tests at the OAI Repository Explorer at http://re.cs.uct.ac.za
- There is an oai-implementers Google Group (formerly a listserve) that provides a great place to ask questions. Sign up at https://groups.google.com/forum/#!forum/oai-pmh.