Technical overview of OAI-PMH

OAI-PMH is a web-based protocol established in 2001 to define a standard way to move metadata from point A to point B via the Internet. It provides rules and a framework for sharing descriptive metadata, both for making metadata available and for acquiring metadata records once they are made available.

Data can be made available in different metadata formats, but to comply with the protocol, at least unqualified Dublin Core must be supported. Requests for records sent via OAI-PMH specify the metadata schema that is required, and if that schema is supported by the responding repository, records will be returned in that schema.

Service providers issue OAI-PMH requests to data providers in order to collect metadata from various repositories and other collections of data. This metadata may then be made available in several ways:

  • by a user-driven web interface;

  • via the Open Search protocol;

  • via the z39.50 protocol; and

  • via OAI-PMH used by other data aggregators.

 

How harvesting works:

  1. The service providers harvester issues an OAI request via http to an OAI compliant repository.

  2. The request is responded to by the repository with metadata encoded in XML in the appropriate metadata format.

Records

When discussing OAI-PMH, a record is metadata represented as structured XML and relates to a single item/object housed by the data provider. Each record is uniquely identified by a metadata field that exists in the repository, for example Dublin Core’s dc:identifier. For further details see the OAI-PMH schema for validating Dublin core.

Status

There are a number of fields in an OAI-PMH record, some of which are optional such as status. The status field indicates whether the record is in a deleted state. This optional field plays a very important part in the harvesting process. It allows harvesting services to keep records up to date and avoid showing duplicates. This field must be supported by repository software in order for this to work. For example if an item is purposely deleted and re-added under another unique identifier/pid the repository software must be capable of updating the OAI-PMH records without further human intervention. This would not be an issue if the harvesting service harvested the entire repository each time, however after the initial harvest, incremental harvests are often performed.

Selective Harvesting

Using sets allows the data provider to organise data into groups to be harvested selectively. An example of this may be only allowing records with a certain subject to be offered using OAI-PMH leaving the remaining items unavailable. If more than one set is configured it is possible for a record to appear in one or more sets depending on the qualifications required by the repository to belong to a set. Records may also be harvested by date range using the datestamps field. A repository must be able to support this functionality by updating the record’s metadata when the object is modified.

OAI Requests

To create a request harvesting services use the GET and POST parameters in conjunction with their choice of the following 6 request types. The completed request consists of the base URL of the data provider followed by the request type in the form of a query string, for example.

http://the-repository.com/oai?verb=ListRecords&metadataPrefix=oai_dc

Where

http://the-repository.com/oai

is the base URL and

?verb=ListRecords&metadataPrefix=oai_dc

is the request type to list all records.

It is important to remember that any special characters in the URL will need to be escaped using the correct URL encoding (not to be confused with HTML encoding).

 

OAI request

OAI response

Identify

Provides basic information about the repository such as repository name, base URL, protocol version, earliest date stamp, granularity, support for deleted records, repository contact e-mail.

ListSets

Provides a list of sets that are established in the repository.

ListMetadataFormats

Provides a list of metadata formats that are supported, for example unqualified Dublin Core.

GetRecord

Provides an individual record in the repository.

ListRecords

Provides the metadata for each record that meets the specified criteria (such as a specific date range).

ListIdentifiers

Provides basic information about each record in the repository that meets the specified criteria, including the Unique Identifier.

 

The final three requests in the table above can be qualified with filters for searching records:

 

For further information on OAI requests and responses see http://www.openarchives.org/OAI/openarchivesprotocol.html

Date ranges
A request can specify a date range for the returned records (e.g. only records added/deleted or changed after 9am March 17, 2005). Most harvesters will specify that it only wants details that have changed since the date of the last harvest. When this is done, a small overlap of time is allowed.
Metadata prefix
All GetRecord, ListIdentifiers and ListRecords requests must have a metadata prefix specified.
Set
A request can specify that it only wants records from a specific set. A set is an optional construct for repositories to group items for the purpose of selective harvesting. A set is probably the best way to enable service providers to harvest only what you want to expose. The procedures for establishing sets vary according to the software application that is being used. The technical support person for your repository should be able to clarify how to do establish sets. For assistance contact CAIRSS at cairss-technical@caul.edu.au.