Thursday, April 1, 2010

Automatic categorization of citations using Perl

I'm glad to announce that we've made tremendous progress on the automatic categorization of PhilPapers entries. We have developed a categorizer which can assign area-level categories to 40% of entries with 98% accuracy. That is, 40% of entries are categorized and 60% are left uncategorized; of all the area-level categories assigned, 98% are correct. We doubt it would be possible to get better performance given all the noise in our training set (it's not like the humans who did the initial categorization are infallible and always aiming for accurate categorization).

We plan to put the categorizer into production shortly after Easter. Half the currently uncategorized entries (about 80,000) should be assigned areas, and half the new items coming into the index should automatically be assigned areas in the future. We also hope that we will be able to increase recall (the number of items categorized) while keeping precision above 95% as our training set improves. Our training set currently has about 120,000 entries and 30 categories.

The algorithm

We've combined two classification techniques: Naive Bayes and Support Vector Machines. An entry gets assigned to a category just in case our separate Bayesian and SVM categorizers assign it to that category. This tends to bias classification towards false negatives, which is exactly what we want in our case. In all our tests, Naive Bayes performs at least slightly better than SVM, but not well enough precision-wise for our purposes.

Our combined classifier works on author names, titles, descriptors, and abstracts, as well as editor names and collection titles for collections and journal names for journal articles.

We've found that feature selection based on a test of independence instead of mere frequency significantly improves performance. We currently use the Χ2 test for this purpose. We retain only words which have a minimum Χ2 value we've determined through trial and error (6000 at the moment).

We've also found that certain feature transformations are essential to attain optimal performance. We transform author names so that "John Smith" becomes a single word: "xxJohnxxSmithxx". This distinguishes names from other orthographically identical words and insures that classification is based on full name matches rather than mere firstname or lastname matches. We also transform journal names in the same way.


We use the AI::Categorizer framework available on CPAN. This framework allowed us to test a range of classification algorithms, feature selection methods, and normalization techniques. While we're glad we've decided to use the framework, we've had to fix a few bugs in it and we've often been frustrated by the lack of documentation. It's not very polished, and some things don't work as one would expect. Hopefully that's going to improve in future releases (it's only at version 0.09 after all; we're going to submit our patches to the maintainer). We're going to release our customized classes soon, but if anyone is interested, email me.

One definite virtue of AI::Categorizer is that the SVM and Naive Bayes categorizer that come with it have excellent default settings. We've played with many different setting combinations (described below), but the defaults turned out best aside from the changes described above and some custom feature weighting we've introduced. Our feature weighting is as follows:

title: 1.5
abstract: 1
journal: 0.5
authors: 0.5
collection title: 0.5
editors: 0.5

While these settings seemed to improve performance, the difference was not always clearly significant.

For stop words, we use the list provided by the Lingua::StopWords package.

We currently use a probability threshold of 0.9 for Naive Bayes.

Χ2 feature selection is done with a patched version of AI::Categorizer::FeatureSelector::ChiSquare.

What else we've tried

We've tried the b,t,n,f,p and c feature weighting flags provided by AI::Categorizer and none helped, either individually or in combination. Some bugs with some of the flags resulted in divisions by zero. We've patched that.

We've tried the polynomial, and radial kernels for SVM, but the default (linear) works best.

We've tried the KNN and DecisionTree classification algorithms, but neither managed to complete the training with more than 10% of our training set (we ran out of memory on a 2GB VM). Either the algorithms or the AI::Categorizer implementations are not sufficiently efficient. Their precision was also worse than Naive Bayes and SVM with small training sets.

We've tried purifying our training set by removing from it all items which could not be successfully categorized even when they were in the training set (normally, we test with different entries than those in the training set). Surprisingly, this didn't help precision.

We've tried to use the Rainbow classifier, but we couldn't get it to compile. Development seems to have been abandoned in 1998.

In a previous project we had tried the Naive Bayes algorithm with every possible heuristic and feature selection / normalization trick conceivable. We could never achieve the performance we're getting now by combination SVM and Naive Bayes.