CAIRSS


Events

Activity

  • Another quick reminder… all CAUL institutions need to have their theses content out of ADT VT-ETD software and into their research repositories by end of September. Tim has been working through a comprehensive strategy to assist individual institutions with the harvesting and data transformation of this VT-ETD content. Please email Tim at cairss-technical@caul.edu.au if you have any questions or require any further assistance.
  • Dr Peter Sefton has completed and posted the CAIRSS blog document Getting into Google, Google Scholar and other search engines. The post has been well received, please do not hesitate to ask questions or share information in the comments section on the post.


[This is a draft of a document for the CAIRSS website we are putting it here on the blog so people can give feedback before it goes live. Please comment below if you have a question that is not answered, a suggestion or if you have spotted a typo.]

Quick start

If you want to get started straight away, we offer this summary of three things to do:

  1. Get into Google Scholar. The single best starting point for getting good exposure is to follow Google’s instructions for how to be indexed in Google Scholar and then contact them using this form. For more discussion on this see below in the section on Getting into Google Scholar.
  2. Get harvested by as many aggregation sites possible. To do this, follow the CAIRSS harvesting guide.
  3. Build a good site. Don’t spend time working on search engine optimization, spend time on repository usability and promoting the repository to your target audience so that people cite your research outputs and link to them.
    • A site with incoming links will be indexed by Google et al. without the repository manager doing anything, as long as you don’t ban web crawlers from indexing your site.
    • Google Scholar is more particular about what it indexes, hence the advice in point one. Follow their guide.
    • Make sure your repository has clear browse-paths, eg by Author, Subject or Title, so users can discover content without searching. This helps the search engines’ robot harvesters to find and index your content. Frequent updates will generally increase the frequency with which search engines index your site; frequent additions of fresh content may influence search engine rankings, but this is not clear.

Getting into Google et al.

The basics

Since the early days of the web, there have been services which provide full-text search indexes of the web. These indexes are constructed by web-crawling software which essentially browses the web a page at a time and indexes its textual content. Broadly speaking, the crawlers follow links, and cannot discover pages that are not linked from somewhere.

Google is famous for improving search results, about ten years ago using the PageRank algorithm which took into account how many links pointed to the pages it was indexing to help order search results. The indexing processes used by search engines today are trade secrets and differ from search engine to search engine, so trying to optimise your exposure by worrying about them is something that would be very resource intensive, not to mention risky, as attempts to ‘game’ the search engines typically result in your site being dropped from the index.

Why be indexed?

Most repositories are designed to promote the research output of a repository by making it discoverable. The main source for discovery, by far, is search engines. Even though the majority of traffic to your repository may not be coming from a search engine, chances are a lot of the traffic is from links and citations where the original discovery was via a search engine.

A recent very informal survey of CAIRSS sites showed that typically around 40% of traffic is coming from search engines and almost all of that search engine traffic is from Google.

Most CAIRSS respondents to the informal survey reported that aggregating sites such as Australian Research Online (ARO) and Trove account for only a few percent of overall traffic, but it is still important to be represented in such sites for a couple of reasons:

  • Their incoming links almost certainly raise the status of your repository with the search engines.
  • An 2008 survey of visitors to the ARO site showed that users were mainly researchers and students seeking information on a specific topic. These are the kinds of searches which will likely result in increased downloads and citations.

It is very difficult to make direct comparisons between sites as there is a huge amount of variation in the way statistics are collected. CAIRSS is working on a guide to repository stats.

Strategies for being indexed

  • Make sure the repository is harvested by discovery services. See this CAIRSS page listing discovery services. As noted above, these are typically not currently major sources of traffic but they do contribute to the rank of your site. And Trove in particular is experiencing very rapid growth, so we may see increases in traffic from there as users searching for Australian content of all kinds discover things from institutional repositories alongside the many other kinds of content indexed in Trove.
  • Follow the advice set out in the Google Scholar Inclusion Guidelines for Webmasters. All of that advice is relevant to making your site visible in Google et al. as well as Google Scholar, with the exception of the Indexing Guidelines section.

Getting into Google Scholar

Google Scholar is very important as it is the largest open index of scholarship accessible to most readers. It does a good job of finding multiple versions of scholarly articles and theses, including various publisher and database sites as well as open access versions and providing interfaces that make it easy for users to download articles into reference management software like Zotero, EndNote and Mendeley (to name a few). Getting Open Access versions of articles into Google Scholar is a key way to further the OA agenda.

CAIRSS and the Trove team from the National Library of Australia have been in discussion with the Google Scholar team. Google Scholar is interested in:

  • Full text, by which they mean HTML and PDF versions of scholarly material.

    If your repository does not have a lot of full text it will not be well-indexed by Google Scholar.

  • Material that that they won’t find elsewhere like fulltext thesis content.

Google Scholar is not set up to index discovery sites like the NLA’s Trove, it is designed to work on repositories which contain the full text of articles. So even if the NLA had enough metadata available to them, as supplied over OAI-PMH in Dublin Core then they could not make an effective Google Scholar web site. That said, being indexed by Trove is important as it will make your repository more visible to search engines, and increase traffic (if only by a little bit).

The Google Scholar documentation is very clear that the full-text download for an item must be ‘in a subdirectory’ that is, the URL must be ‘under’ the metadata page. To use a real example, this is the metadata page for an item in the USQ repository:

http://eprints.usq.edu.au/3838/

And the full text is in a ‘subdirectory’ (actually there are not really directories involved):

http://eprints.usq.edu.au/3838/1/BuildingInstitutionalInfrastructureInRegionalAustralia.pdf

This should not, in general be a problem for your repository as most are set up so that data streams are referenced in this way but if, for example, you have made very zealous use of handles and try to reference a datastream via a handle or DOI then Google Scholar may not index it. This could be a problem also for some of non-mainstream proprietary repository software some of which appears to make use of indirect links. Contact CAIRSS if you have a problem.

The best way to be indexed in Google Scholar is to add metadata to your summary pages. This is covered in Indexing Guidelines in their guide. There are a variety of formats that can work but one in particular is recommended. This is supported ‘out of the box’ by EPrints for other sites CAIRSS can assist in networking you with sites using your software.

CAIRSS sites: if you have configured your repository or want help then please contact CAIRSS and we will compile a technical guide using information from the community on how to set up the metadata for various software.

We are not going to reproduce Google’s very clear advice here, and recommend that CAIRSS sites follow their guide and then contact them using this form.

Troubleshooting

Not being indexed?

The number one reason you might not be able to find content in Google et al. is that your robots.txt file prevents access to their crawlers. This is worth checking. For example CAIRSS discovered that one of the commercial software vendors has recently shipped updates to their software which by default blocks access to web crawlers.

Other common reasons for a lack of indexed pages would be:

  • If your site has had extended periods of down-time search engines will eventually start dropping your content from the index.
  • If your site has changed its URLs without putting in place redirects so that the old URLs map to new ones. This is a basic governance process which needs to be in place for all of an institutions websites, but is particularly important for repositories. If you change domain names, or software, or upgrade some software packages then you may need to add redirect rules which permanently redirect browser-clients to the new home of URL.

    Note that using Handles does not automatically or magically fix this problem for two main reasons:

    • The Handles database needs to be updated with new URLs, something which we have found that even commercial vendors do not do when upgrading a repository.
    • Many incoming links to the repository will use the ‘ordinary’ URL for full-text items and metadata pages, if these links break will undermine the integrity of your repository and its usefulness in exposing research via increased readership and citation rates.

Not seeing traffic referred from search engines?

This is not necessarily a question to do with search engines, but as noted above, we have found that some repository software uses a lot of redirects. The URL a user sees and clicks on may be to one university site, which then redirects to the repository. The stats module in the repository may then report that all the traffic is coming from university referrers, even though the original referring site has about 40% chance of being a search engine.

The same thing is being indexed multiple times?

If the same content appears in your site on more than one URL then that can dilute the impact of the content in a search index. This can happen when the repository adds session information to its URLs or when multiple views of the same content can be set up. The solution is to add some metadata all the pages showing the same content to say which version is the canonical one. See this guide, which works for multiple search engines.

Other resources

We will not attempt to provide a complete guide here, and do not have the resources to keep it up to date. For general indexing /search issues try Google’s web master help forum.

This document was compiled by Peter Sefton with input from the CAIRSS team and the CAIRSS community.

Copyright USQ, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1


For a blunt exposition on why Search Engine Optimization is not helpful see Spammers, Evildoers and Opportunists, by Derek Powazek.
There are lots of complexities here, for example many ARROW/VITAL sites banned crawlers to make previous versions of the software more stable, meaning they have small footprints in Google, and some sites have URLs which are re-directed a couple of times meaning the repository statistics module sees only referring traffic from the same institution.


Events

  • Registration is now open for the 2010 CAIRSS Community Day in November.
    See CAIRSS website for further details: http://cairss.caul.edu.au/www/events/cairss_community_day_2010.htm.
    2 delegates from each CAUL institution are invited to attend. Register by emailing your details to cairss@caul.edu.au. Registrations close 25th October 2010.
  • 2010 CAIRSS Community Day program is now available on the CAIRSS website. See: http://cairss.caul.edu.au/www/events/cairss_community_day_program_2010.pdf. Thank you to the community members who nominated to give presentations.
  • The Brisbane CAIRSS Copyright Workshop was held this week with good attendance. Across the Sydney, Perth and Brisbane workshops we had 87 participants in total (with representation of 90% of CAUL institutions). Additional groups who participated in the workshops included CSIRO, ANDS and NLA.

Activity

  • Please note CAIRSS Project Manager Katy Watson will be on annual leave from 20th August to
    7th September 2010. Community members can continue to contact CAIRSS staff member Tim McCallum during this time on cairss-technical@caul.edu.au and 07 4631 2129.
  • Reminder all CAUL institutions need to have their theses content out of ADT VT-ETD software and into their research repositories by end of September (only 5 weeks remaining for this work to be done). A new section outlining these ADT changes and new requirements has been added to the CAIRSS website at: http://cairss.caul.edu.au/www/theses/new_nla_theses_view.htm. Contact Tim at CAIRSS for assistance migrating your theses content from VT-ETD to your IR
    cairss-technical@caul.edu.au. If an institutions theses content is not migrated from institutional VT-ETD system to institutional research repositories by this time, that theses content will not be included in the new NLA theses display (and ADT replacement).
  • Dr Peter Sefton and the CAIRSS team have almost finalised the new webpage Getting into Google, Google Scholar and other search engines. Coming soon.

For your interest

  • Tim McCallum has given the CAIRSS website a bit of a usability makeover to comply better with web usability standards. If you notice any problems with the new CAIRSS website please let us know at cairss-technical@caul.edu.au so we can fix it.


Events

  • Registration is now open for the 2010 CAIRSS Community Day in November.
    See CAIRSS website for further details: http://cairss.caul.edu.au/www/events/cairss_community_day_2010.htm.
    2 delegates from each CAUL institution are invited to attend. Register by emailing your details to cairss@caul.edu.au. Registrations close 25th October 2010.
    Program currently under development (to be released late August).
  • The Perth CAIRSS Copyright Workshop went well last week with 23 delegates attending.
    The final CAIRSS Copyright Workshop for 2010 will be taking place in Brisbane on the 17 August (http://cairss.caul.edu.au/www/copyright/cairss_copyright_workshop.htm). With 26 delegates registered to attend this should be a useful meeting.

Activity

For your interest

  • The Australian National Data Service (ANDS) has developed an annual survey to enable them to better understand perceptions of ANDS. This is the first time the survey has been undertaken and offers an opportunity to provide input that will help ANDS direct its efforts more effectively. For those interested in taking the survey see: http://surveys.insyncsurveys.com.au/surveys/ANDSAttitude2010/.
  • August issue of the SPARC Open Access Newsletter is now available at: http://www.earlham.edu/~peters/fos/newsletter/08-02-10.htm.
  • In the run-up to Open Access Week, 18th-24th October 2010, the guys from ‘Repositories Support Project’ (http://www.rsp.ac.uk/) have suggested to raise awareness of the event for a wider community, proposing the week as a subject for a Google Doodle for their main home page or the home page of Google Scholar. Since Google will receive large numbers of suggestions for Doodles they ask if the OA community could join them in lobbying Google to consider OA week. To make a suggestion for the Doodle simply email Google at proposals@google.com.
    Suggested email text: The week of October 18th is international Open Access week, with events and activities taking place world-wide on university and research campuses. We believe that Open Access, researchers making their research articles available for free to all, is an exciting and important shift in the availability of research for scholars and the public. We would like to support the idea of a Google Doodle on Open Access for this week.


By Peter Sefton & Vicki Picasso

[Update 2010-08-06 - Fixed URI for Kate Watson] [Update 2010-08-11 by Peter - Fixed some typos. Thanks to Andrew Treloar for taking the time to point them out (we had a document corruption incident during the authoring process and not all our edits made it through). Apparently we have used 'which' when Andrew says we should have used 'that', which I didn't fix.]

In this post we want to explore some of the new trends in repository metadata that the CAIRSS community will need to be thinking about over the next few years, the main one being the emergence of linked-data semantic web approaches to resource description, using sets of statements about resources rather than metadata schemas like Dublin Core or MARC. We’ll do this by looking at a collaborative project we’re doing for ANDS setting up a repository for research data, which is something we’re all going to be doing in the future.

The Australian National Data Service (ANDS) is funding a number of projects to ‘seed the data commons’, and there are a number of clusters of activity developing around institutional stores of metadata for research data (that’s a clumsy phrase, but there’s not easy way to name this class of application). One of the activity clusters centres around the University of Newcastle where there is a strategic institutional approach in this space, between the Library, the Research Office and IT. Newcastle has developed a model for a research data metadata registry, which Peter’s team at ADFI is helping to bring to life via a software development project which will build a metadata store for research data that can be installed alongside the Institutional Repository. Teula Morgan from Swinburne University is also working with the project with a view to potential implementation of the metadata store registry at Swinburne. This would also be along-side their IR and under their ANDS Seeding the Commons activity.

One of the starting points of this project is a commitment from ANDS, and our chief stakeholder on project EIF-040 , Dr Andrew Treloar, to take a Linked Data approach to the metadata we will collect and manage for research data and related entities. There are a couple of points to this:

  1. Terminology – The most important thing is to (for once) agree on terminology so when metadata is aggregated into services like Research Data Australia we’re all calling the same ARC project by the same name, and using subject codes that match up. This is a real issue as it’s not obvious to a machine that 010000 – MATHEMATICAL SCIENCES is the same as MATHEMATICAL SCIENCES: 1000. The point is that even where there are controlled vocabularies we don’t all use them in exactly the same way.
  2. If we’re able to agree on terminology, and on an information model for our metadata, then we can use URIs as identifiers for terms and concepts meaning that people and machines can both be identified using names that self-document. Resolve the URI (ie follow the ‘link’) and you can get a human AND machine readable descriptions of the resource being named.

Another cluster of activity working in this space is the ANDS-VITRO group who have had a bit of a head-start on us. The ANDS-VITRO (Melbourne, Griffith and QUT) group are collaborating on what appears to be an even more aggressively linked-data approach than the one we’re taking, based on the VIVO semantic web application. They will be using a pure RDF application for their metadata hub meaning that everything is expressed in terms of statements where all the subjects, objects and predicates have URIs.

While these two clusters of activity are using different software there is however a great opportunity at the onset to model a consistent approach (where possible) for the common use/agreement on some core vocabularies and or ontologies. The importance of this is evident from lessons learnt from Australian institutional repositories implementations. Just as repositories around Australia used different software (Eprints, Dspace, VITAL, Fez, et al), they also used variant vocabularies for local set-ups. An example is the different forms of resource type names which can create a problem upstream for holistic discovery, i.e. the challenge of normalisation of the data. Ideally we should strive for common vocabularies which are also flexible enough to allow for local mappings and variants.

But, Teula Morgan has reminded us this week about the precedent in the institutional repository world where early promise of standardisation turned out not to be not so promising when people discovered that proposed standards didn’t match their local requirements. Still, there is a opportunity for local customisation in the semantic web-approach we might, for example, all use the same URI for a thesis but you might call it a thesis and I might call it a dissertation, and we could extend the local description to match our own set of degrees.

So, this post looks at what the new interoperable approach to metadata in repositories would look like. We both have a few years experience with IR software and metadata practice and we hope we’ve both learned a few lessons.

The basic rule for this project from which all others follow, pretty much is:

No strings. Use URIs.

That is, when we refer to something in metadata we don’t say it’s of type Article which risks confusion with someone elses Journal Article, we say it’s an Article according to the bibliographic ontotlogy, Bibo.

In the original open-access publications archive software, Eprints, the metadata is mostly a set of name-value pairs, so metadata is along the lines of:

Item type: Article
Title: My article

and so on, with slightly more complex structure for capturing names. DSpace takes a similar approach, and in both cases the metadata maps neatly onto Dublin Core for interchange over OAI-PMH. In the ARROW/VITAL world with which Vicki is familiar, the application can store anything you want in the underlying Fedora repository, but all the ARROW sites in Australia chose MARC XML (or MODS which is equivalent but easier for humans), a bibliographic standard, with a more complex structure than name:value pairs, but still with all the metadata in one schema in a single file.

Now, the approach being taken by the ANDS-VITRO people is completely different. The metadata they are using is a set of assertions, statements about resources. We will spare our readers (and ourselves) the gory details of the underlying RDF, but paraphrasing, in English a record looks something like this:

<This item> <is a> <collection of research data>.

<This item> <has principle investigator> <http://nla.gov.au/nla.party-583221 >

<http://nla.gov.au/nla.party-583221 > <has name string> Katy Watson

And so on. The things in <> are all URIs, and ideally they would all resolve to both a human readable description of the thing, relationship or concept they name. In this example we have an explicit statement that for this resource the person identified by this URI <http://nla.gov.au/nla.party-583221 > is going by the name Katy Watson (actually at time of writing there are at least two people represented by that URI, Kate will have to follow that up with the NLA) for the purposes of this data collection, but in other contexts she might be Watson, K or other variants. Katy is going to change her name soon, so she’ll make a great case-study for those of us working on name authority services as well as working through the processes required for a modern researcher in name-authority land.

The advantage of this approach is that all sorts of different vocabularies can be used together if something is already defined in Dublin Core, or MARC then you can use that in a description. This contrasts with the approach you would have to take in Eprints or DSpace type name-value systems where you would just define a new locally-relevant name for something, say data-retention-time, in the linked-data approach you can use the URI. The main point being that the use of an ontology lets us agree on a set of defined things and STILL use locally-relevant strings.7

This is the approach that the ANDS-VITRO consortium have been taking, and they tell us that they are reasonably confident that they have covered the data that ANDS will require for research data Australia, i.e. for RIF-CS. Simon Porter from Melbourne University is working to make sure that they have the power to describe not only the attributes of research data that ANDS care about, but other things such as retention times which the organisation and Research Office will need for compliance with codes of practice, funder requirements and legislation.

The next step for our collaboration is for Vicki to work with the project team at Newcastle, and our network of collaborators including Swinburne and USQ to look at the work that’s been done by ANDS-VITRO and to work out where we can (a) suggest extension that will be useful broadly to and (b) if absolutely necessary, Newcastle-specific things that need to be added, however the desire of our projects is to develop models that have wider applicability for the community.

Note that while we are talking about all this complex metadata schema stuff behind the scenes, the users won’t have to deal with more complexity than they do, they should see less. They’ll simply be picking subject codes from lists, and when they type in a name the system will help them pick which person they’re talking about.

Copyright Peter Sefton and Vicki Picasso, 2010.


The URI for that is http://purl.org/ontology/bibo/Article but don’t click that or you will be confronted by an angry OWL.