By Peter Sefton & Vicki Picasso
[Update 2010-08-06 - Fixed URI for Kate Watson] [Update 2010-08-11 by Peter - Fixed some typos. Thanks to Andrew Treloar for taking the time to point them out (we had a document corruption incident during the authoring process and not all our edits made it through). Apparently we have used 'which' when Andrew says we should have used 'that', which I didn't fix.]
In this post we want to explore some of the new trends in repository metadata that the CAIRSS community will need to be thinking about over the next few years, the main one being the emergence of linked-data semantic web approaches to resource description, using sets of statements about resources rather than metadata schemas like Dublin Core or MARC. We’ll do this by looking at a collaborative project we’re doing for ANDS setting up a repository for research data, which is something we’re all going to be doing in the future.
The Australian National Data Service (ANDS) is funding a number of projects to ‘seed the data commons’, and there are a number of clusters of activity developing around institutional stores of metadata for research data (that’s a clumsy phrase, but there’s not easy way to name this class of application). One of the activity clusters centres around the University of Newcastle where there is a strategic institutional approach in this space, between the Library, the Research Office and IT. Newcastle has developed a model for a research data metadata registry, which Peter’s team at ADFI is helping to bring to life via a software development project which will build a metadata store for research data that can be installed alongside the Institutional Repository. Teula Morgan from Swinburne University is also working with the project with a view to potential implementation of the metadata store registry at Swinburne. This would also be along-side their IR and under their ANDS ‘Seeding the Commons’ activity.
One of the starting points of this project is a commitment from ANDS, and our chief stakeholder on project EIF-040 , Dr Andrew Treloar, to take a Linked Data approach to the metadata we will collect and manage for research data and related entities. There are a couple of points to this:
- Terminology – The most important thing is to (for once) agree on terminology so when metadata is aggregated into services like Research Data Australia we’re all calling the same ARC project by the same name, and using subject codes that match up. This is a real issue as it’s not obvious to a machine that 010000 – MATHEMATICAL SCIENCES” is the same as “MATHEMATICAL SCIENCES: 1000”. The point is that even where there are controlled vocabularies we don’t all use them in exactly the same way.
- If we’re able to agree on terminology, and on an information model for our metadata, then we can use URIs as identifiers for terms and concepts – meaning that people and machines can both be identified using names that self-document. Resolve the URI (ie follow the ‘link’) and you can get a human AND machine readable descriptions of the resource being named.
Another cluster of activity working in this space is the ANDS-VITRO group who have had a bit of a head-start on us. The ANDS-VITRO (Melbourne, Griffith and QUT) group are collaborating on what appears to be an even more aggressively linked-data approach than the one we’re taking, based on the VIVO semantic web application. They will be using a pure RDF application for their metadata hub meaning that everything is expressed in terms of statements where all the subjects, objects and predicates have URIs.
While these two clusters of activity are using different software there is however a great opportunity at the onset to model a consistent approach (where possible) for the common use/agreement on some core vocabularies and or ontologies. The importance of this is evident from lessons learnt from Australian institutional repositories implementations. Just as repositories around Australia used different software (Eprints, Dspace, VITAL, Fez, et al), they also used variant vocabularies for local set-ups. An example is the different forms of resource type names which can create a problem upstream for holistic discovery, i.e. the challenge of normalisation of the data. Ideally we should strive for common vocabularies which are also flexible enough to allow for local mappings and variants.
But, Teula Morgan has reminded us this week about the precedent in the institutional repository world where early promise of standardisation turned out not to be not so promising when people discovered that proposed standards didn’t match their local requirements. Still, there is a opportunity for local customisation in the semantic web-approach – we might, for example, all use the same URI for a “thesis” but you might call it a thesis and I might call it a dissertation, and we could extend the local description to match our own set of degrees.
So, this post looks at what the “new” interoperable approach to metadata in repositories would look like. We both have a few years experience with IR software and metadata practice and we hope we’ve both learned a few lessons.
The basic rule for this project from which all others follow, pretty much is:
No strings. Use URIs.
That is, when we refer to something in metadata we don’t say it’s of type “Article” which risks confusion with someone else’s “Journal Article”, we say it’s an Article according to the bibliographic ontotlogy, Bibo.
In the original open-access publications archive software, Eprints, the metadata is mostly a set of name-value pairs, so metadata is along the lines of:
Item type: Article
Title: My article
and so on, with slightly more complex structure for capturing names. DSpace takes a similar approach, and in both cases the metadata maps neatly onto Dublin Core for interchange over OAI-PMH. In the ARROW/VITAL world with which Vicki is familiar, the application can store anything you want in the underlying Fedora repository, but all the ARROW sites in Australia chose MARC XML (or MODS which is equivalent but easier for humans), a bibliographic standard, with a more complex structure than name:value pairs, but still with all the metadata in one schema in a single file.
Now, the approach being taken by the ANDS-VITRO people is completely different. The metadata they are using is a set of assertions, statements about resources. We will spare our readers (and ourselves) the gory details of the underlying RDF, but paraphrasing, in English a record looks something like this:
<This item> <is a> <collection of research data>.
<This item> <has principle investigator> <http://nla.gov.au/nla.party-583221 >
<http://nla.gov.au/nla.party-583221 > <has name string> “Katy Watson”
And so on. The things in <> are all URIs, and ideally they would all resolve to both a human readable description of the thing, relationship or concept they name. In this example we have an explicit statement that for this resource the person identified by this URI <http://nla.gov.au/nla.party-583221 > is going by the name “Katy Watson” (actually at time of writing there are at least two people represented by that URI, Kate will have to follow that up with the NLA) for the purposes of this data collection, but in other contexts she might be Watson, K or other variants. Katy is going to change her name soon, so she’ll make a great case-study for those of us working on name authority services as well as working through the processes required for a modern researcher in name-authority land.
The advantage of this approach is that all sorts of different vocabularies can be used together – if something is already defined in Dublin Core, or MARC then you can use that in a description. This contrasts with the approach you would have to take in Eprints or DSpace type name-value systems where you would just define a new locally-relevant name for something, say “data-retention-time”, in the linked-data approach you can use the URI. The main point being that the use of an ontology lets us agree on a set of defined things and STILL use locally-relevant strings.7
This is the approach that the ANDS-VITRO consortium have been taking, and they tell us that they are reasonably confident that they have covered the data that ANDS will require for research data Australia, i.e. for RIF-CS. Simon Porter from Melbourne University is working to make sure that they have the power to describe not only the attributes of research data that ANDS care about, but other things such as retention times which the organisation and Research Office will need for compliance with codes of practice, funder requirements and legislation.
The next step for our collaboration is for Vicki to work with the project team at Newcastle, and our network of collaborators including Swinburne and USQ to look at the work that’s been done by ANDS-VITRO and to work out where we can (a) suggest extension that will be useful broadly to and (b) if absolutely necessary, Newcastle-specific things that need to be added, however the desire of our projects is to develop models that have wider applicability for the community.
Note that while we are talking about all this complex metadata schema stuff behind the scenes, the users won’t have to deal with more complexity than they do, they should see less. They’ll simply be picking subject codes from lists, and when they type in a name the system will help them pick which person they’re talking about.
Copyright Peter Sefton and Vicki Picasso, 2010.