Thursday, April 8, 2010

Syndicating content from institutional repositories

Institutional repositories powered by software like EPrints and D-Space hold lots of good research content. How do academics potentially interested in this content find it? Obviously, if you're looking for literature on a research topic, say the nature of dispositional properties, you're not going to visit every single IR to do a local search on "dispositional properties". You'll likely do one of three things:
  1. Google "dispositional properties".
  2. Search Google Scholar, Web of Science, or some other generic research index for "dispositional properties".
  3. Search a relevant subject repository for "dispositional properties". In this case PhilPapers would serve you well. Try searching for "dispositional properties" on PP. Not only do you get tons of highly relevant content, but you get a link to a bibliography on dispositional properties maintained by an expert on the subject.
I know based on my logs that a lot of people go for option (3) a lot of the time. PP will generally give you more relevant results than options (1) and (2), and it will often turn up papers that are not even indexed by Scholar or Web of Science & co. That's what it's designed to do.

But my aim here is not to brag about PP's virtues. It's to point out that the content held by IRs can only reach its intended audience if it's properly indexed, and it can only have a maximum impact if it is properly indexed by SRs like PP. But indexing IRs along subject-specific lines is not trivial.

The challenge

How can SRs extract only the content that's relevant to their subject from IRs? Some suggestions:
  1. Ask content producers (academics) to submit their metadata to SRs.
  2. Harvest all content from all IRs, and filter out irrelevant content based on keywords or more advanced document classification techniques.
  3. Crowd-source a list of IRs and OAI sets relevant to your subject.
Of course, all three solutions can be pursued in parallel, but it's worth asking how much content one can expect to get through each of them.

I wouldn't expect much content from (1). Most academics barely have enough energy to submit their papers to IRs; don't ask them to submit them to relevant SRs on top of that. What they could do is submit only to SRs, but this wouldn't be a solution to the problem of syndicating IRs. (Incidentally, we've seen a drastic reduction in philosophy content coming from the few other archives we've been harvesting since PP's launch, so I think the transition to submitting directly to SRs is happening, but there's still a lot of content in IRs we'd like to dig up.)

(2) can't be relied on in isolation. It's a major document classification challenge. However, a lot of papers in IRs have relevant dc:subject attributes. We've tried filtering content from IRs based on the occurrence of the word 'philosophy' in dc:subject attributes. We think that this classification method has a high precision and decent (but far from perfect) recall. I'll post some data shortly.

(3) is also far from perfect, if only because many IRs don't have relevant OAI sets. But a number of them do (for example, the philosophy department has its own set). When that's the case, that's definitely the technique to use.

Our solution

In the end, the solution we've settled for on PP is a combination of all three approaches. We've created an "archive registry" which we've initially populated with all archives listed in OpenDOAR. The registry is not public at the time of writing, but it will soon be. We're currently in the process of filtering out any clearly irrelevant archive (most are).

When we harvest an archive, we first check if it has sets matching certain keywords (our "subject keywords"). If it does, we harvest all and only the contents of these sets. If it doesn't, we harvest everything and filter by matching dc:subject attributes against our subject keywords.

The public registry comes handy when a user notices that their IR-hosted papers aren't making it into PP as they would expect. In this case they can add or edit their IR as appropriate (or they can get their IR manager to do that). Users have the ability to force either set-harvesting or keyword-filtering behaviour. They can also set up a more complex hybrid process combining set- and keyword-filtering. We will bring this system online a.s.a.p and adjust as we see fit based on the feedback we get.

Residual issues

There are other challenges aside from subject-specific harvesting.

One is broken archives. A lot of archives are broken, or have broken content. The most common problem is invalid HTML embedded in the metadata. This can be a serious problem, because a single item can block the harvesting process. If we see that too many archives are stuck like this we'll have to start sending out notices to the admin addresses.

Another issue is that archive metadata are generally incomplete as far as their machine-readable content goes. Typically, either we can't extract publication details or they are missing. That's a major limitation for a site like PP which makes extensive use of such information. I can't think of an easy fix to that. The solution would be good citation extraction tech. I'm not aware of any sufficiently reliable, publicly available algorithm.