Thursday, April 8, 2010

Syndicating content from institutional repositories

Institutional repositories powered by software like EPrints and D-Space hold lots of good research content. How do academics potentially interested in this content find it? Obviously, if you're looking for literature on a research topic, say the nature of dispositional properties, you're not going to visit every single IR to do a local search on "dispositional properties". You'll likely do one of three things:
  1. Google "dispositional properties".
  2. Search Google Scholar, Web of Science, or some other generic research index for "dispositional properties".
  3. Search a relevant subject repository for "dispositional properties". In this case PhilPapers would serve you well. Try searching for "dispositional properties" on PP. Not only do you get tons of highly relevant content, but you get a link to a bibliography on dispositional properties maintained by an expert on the subject.
I know based on my logs that a lot of people go for option (3) a lot of the time. PP will generally give you more relevant results than options (1) and (2), and it will often turn up papers that are not even indexed by Scholar or Web of Science & co. That's what it's designed to do.

But my aim here is not to brag about PP's virtues. It's to point out that the content held by IRs can only reach its intended audience if it's properly indexed, and it can only have a maximum impact if it is properly indexed by SRs like PP. But indexing IRs along subject-specific lines is not trivial.

The challenge

How can SRs extract only the content that's relevant to their subject from IRs? Some suggestions:
  1. Ask content producers (academics) to submit their metadata to SRs.
  2. Harvest all content from all IRs, and filter out irrelevant content based on keywords or more advanced document classification techniques.
  3. Crowd-source a list of IRs and OAI sets relevant to your subject.
Of course, all three solutions can be pursued in parallel, but it's worth asking how much content one can expect to get through each of them.

I wouldn't expect much content from (1). Most academics barely have enough energy to submit their papers to IRs; don't ask them to submit them to relevant SRs on top of that. What they could do is submit only to SRs, but this wouldn't be a solution to the problem of syndicating IRs. (Incidentally, we've seen a drastic reduction in philosophy content coming from the few other archives we've been harvesting since PP's launch, so I think the transition to submitting directly to SRs is happening, but there's still a lot of content in IRs we'd like to dig up.)

(2) can't be relied on in isolation. It's a major document classification challenge. However, a lot of papers in IRs have relevant dc:subject attributes. We've tried filtering content from IRs based on the occurrence of the word 'philosophy' in dc:subject attributes. We think that this classification method has a high precision and decent (but far from perfect) recall. I'll post some data shortly.

(3) is also far from perfect, if only because many IRs don't have relevant OAI sets. But a number of them do (for example, the philosophy department has its own set). When that's the case, that's definitely the technique to use.

Our solution

In the end, the solution we've settled for on PP is a combination of all three approaches. We've created an "archive registry" which we've initially populated with all archives listed in OpenDOAR. The registry is not public at the time of writing, but it will soon be. We're currently in the process of filtering out any clearly irrelevant archive (most are).

When we harvest an archive, we first check if it has sets matching certain keywords (our "subject keywords"). If it does, we harvest all and only the contents of these sets. If it doesn't, we harvest everything and filter by matching dc:subject attributes against our subject keywords.

The public registry comes handy when a user notices that their IR-hosted papers aren't making it into PP as they would expect. In this case they can add or edit their IR as appropriate (or they can get their IR manager to do that). Users have the ability to force either set-harvesting or keyword-filtering behaviour. They can also set up a more complex hybrid process combining set- and keyword-filtering. We will bring this system online a.s.a.p and adjust as we see fit based on the feedback we get.

Residual issues

There are other challenges aside from subject-specific harvesting.

One is broken archives. A lot of archives are broken, or have broken content. The most common problem is invalid HTML embedded in the metadata. This can be a serious problem, because a single item can block the harvesting process. If we see that too many archives are stuck like this we'll have to start sending out notices to the admin addresses.

Another issue is that archive metadata are generally incomplete as far as their machine-readable content goes. Typically, either we can't extract publication details or they are missing. That's a major limitation for a site like PP which makes extensive use of such information. I can't think of an easy fix to that. The solution would be good citation extraction tech. I'm not aware of any sufficiently reliable, publicly available algorithm.


A definition of "subject repository"

A discussion on the JISC conference site prompted me to suggest a definition of "subject repository" which I hope might be useful in discussing the place of services like PhilPapers:
A subject repository is a repository of research outputs (and possibly metadata about such outputs) whose primary mission is to give end users access to all and only the research content available in a given subject.
For the purposes of this definition, a subject can be as vast as the mathematical sciences or as narrow as any dissertation topic.

Of course, some services we're inclined to label as subject repositories impose subject-independent constraints on their content (e.g. open access availability). That's why I inserted the "primary mission" qualification--this allows that some SRs have auxilarity requirements which impose some limitations on what content they aggregate.

Thursday, April 1, 2010

Automatic categorization of citations using Perl

I'm glad to announce that we've made tremendous progress on the automatic categorization of PhilPapers entries. We have developed a categorizer which can assign area-level categories to 40% of entries with 98% accuracy. That is, 40% of entries are categorized and 60% are left uncategorized; of all the area-level categories assigned, 98% are correct. We doubt it would be possible to get better performance given all the noise in our training set (it's not like the humans who did the initial categorization are infallible and always aiming for accurate categorization).

We plan to put the categorizer into production shortly after Easter. Half the currently uncategorized entries (about 80,000) should be assigned areas, and half the new items coming into the index should automatically be assigned areas in the future. We also hope that we will be able to increase recall (the number of items categorized) while keeping precision above 95% as our training set improves. Our training set currently has about 120,000 entries and 30 categories.

The algorithm

We've combined two classification techniques: Naive Bayes and Support Vector Machines. An entry gets assigned to a category just in case our separate Bayesian and SVM categorizers assign it to that category. This tends to bias classification towards false negatives, which is exactly what we want in our case. In all our tests, Naive Bayes performs at least slightly better than SVM, but not well enough precision-wise for our purposes.

Our combined classifier works on author names, titles, descriptors, and abstracts, as well as editor names and collection titles for collections and journal names for journal articles.

We've found that feature selection based on a test of independence instead of mere frequency significantly improves performance. We currently use the Χ2 test for this purpose. We retain only words which have a minimum Χ2 value we've determined through trial and error (6000 at the moment).

We've also found that certain feature transformations are essential to attain optimal performance. We transform author names so that "John Smith" becomes a single word: "xxJohnxxSmithxx". This distinguishes names from other orthographically identical words and insures that classification is based on full name matches rather than mere firstname or lastname matches. We also transform journal names in the same way.

Implementation

We use the AI::Categorizer framework available on CPAN. This framework allowed us to test a range of classification algorithms, feature selection methods, and normalization techniques. While we're glad we've decided to use the framework, we've had to fix a few bugs in it and we've often been frustrated by the lack of documentation. It's not very polished, and some things don't work as one would expect. Hopefully that's going to improve in future releases (it's only at version 0.09 after all; we're going to submit our patches to the maintainer). We're going to release our customized classes soon, but if anyone is interested, email me.

One definite virtue of AI::Categorizer is that the SVM and Naive Bayes categorizer that come with it have excellent default settings. We've played with many different setting combinations (described below), but the defaults turned out best aside from the changes described above and some custom feature weighting we've introduced. Our feature weighting is as follows:

title: 1.5
abstract: 1
journal: 0.5
authors: 0.5
collection title: 0.5
editors: 0.5

While these settings seemed to improve performance, the difference was not always clearly significant.

For stop words, we use the list provided by the Lingua::StopWords package.

We currently use a probability threshold of 0.9 for Naive Bayes.

Χ2 feature selection is done with a patched version of AI::Categorizer::FeatureSelector::ChiSquare.

What else we've tried

We've tried the b,t,n,f,p and c feature weighting flags provided by AI::Categorizer and none helped, either individually or in combination. Some bugs with some of the flags resulted in divisions by zero. We've patched that.

We've tried the polynomial, and radial kernels for SVM, but the default (linear) works best.

We've tried the KNN and DecisionTree classification algorithms, but neither managed to complete the training with more than 10% of our training set (we ran out of memory on a 2GB VM). Either the algorithms or the AI::Categorizer implementations are not sufficiently efficient. Their precision was also worse than Naive Bayes and SVM with small training sets.

We've tried purifying our training set by removing from it all items which could not be successfully categorized even when they were in the training set (normally, we test with different entries than those in the training set). Surprisingly, this didn't help precision.

We've tried to use the Rainbow classifier, but we couldn't get it to compile. Development seems to have been abandoned in 1998.

In a previous project we had tried the Naive Bayes algorithm with every possible heuristic and feature selection / normalization trick conceivable. We could never achieve the performance we're getting now by combination SVM and Naive Bayes.