Thursday, April 1, 2010

Automatic categorization of citations using Perl

I'm glad to announce that we've made tremendous progress on the automatic categorization of PhilPapers entries. We have developed a categorizer which can assign area-level categories to 40% of entries with 98% accuracy. That is, 40% of entries are categorized and 60% are left uncategorized; of all the area-level categories assigned, 98% are correct. We doubt it would be possible to get better performance given all the noise in our training set (it's not like the humans who did the initial categorization are infallible and always aiming for accurate categorization).

We plan to put the categorizer into production shortly after Easter. Half the currently uncategorized entries (about 80,000) should be assigned areas, and half the new items coming into the index should automatically be assigned areas in the future. We also hope that we will be able to increase recall (the number of items categorized) while keeping precision above 95% as our training set improves. Our training set currently has about 120,000 entries and 30 categories.

The algorithm

We've combined two classification techniques: Naive Bayes and Support Vector Machines. An entry gets assigned to a category just in case our separate Bayesian and SVM categorizers assign it to that category. This tends to bias classification towards false negatives, which is exactly what we want in our case. In all our tests, Naive Bayes performs at least slightly better than SVM, but not well enough precision-wise for our purposes.

Our combined classifier works on author names, titles, descriptors, and abstracts, as well as editor names and collection titles for collections and journal names for journal articles.

We've found that feature selection based on a test of independence instead of mere frequency significantly improves performance. We currently use the Χ2 test for this purpose. We retain only words which have a minimum Χ2 value we've determined through trial and error (6000 at the moment).

We've also found that certain feature transformations are essential to attain optimal performance. We transform author names so that "John Smith" becomes a single word: "xxJohnxxSmithxx". This distinguishes names from other orthographically identical words and insures that classification is based on full name matches rather than mere firstname or lastname matches. We also transform journal names in the same way.


We use the AI::Categorizer framework available on CPAN. This framework allowed us to test a range of classification algorithms, feature selection methods, and normalization techniques. While we're glad we've decided to use the framework, we've had to fix a few bugs in it and we've often been frustrated by the lack of documentation. It's not very polished, and some things don't work as one would expect. Hopefully that's going to improve in future releases (it's only at version 0.09 after all; we're going to submit our patches to the maintainer). We're going to release our customized classes soon, but if anyone is interested, email me.

One definite virtue of AI::Categorizer is that the SVM and Naive Bayes categorizer that come with it have excellent default settings. We've played with many different setting combinations (described below), but the defaults turned out best aside from the changes described above and some custom feature weighting we've introduced. Our feature weighting is as follows:

title: 1.5
abstract: 1
journal: 0.5
authors: 0.5
collection title: 0.5
editors: 0.5

While these settings seemed to improve performance, the difference was not always clearly significant.

For stop words, we use the list provided by the Lingua::StopWords package.

We currently use a probability threshold of 0.9 for Naive Bayes.

Χ2 feature selection is done with a patched version of AI::Categorizer::FeatureSelector::ChiSquare.

What else we've tried

We've tried the b,t,n,f,p and c feature weighting flags provided by AI::Categorizer and none helped, either individually or in combination. Some bugs with some of the flags resulted in divisions by zero. We've patched that.

We've tried the polynomial, and radial kernels for SVM, but the default (linear) works best.

We've tried the KNN and DecisionTree classification algorithms, but neither managed to complete the training with more than 10% of our training set (we ran out of memory on a 2GB VM). Either the algorithms or the AI::Categorizer implementations are not sufficiently efficient. Their precision was also worse than Naive Bayes and SVM with small training sets.

We've tried purifying our training set by removing from it all items which could not be successfully categorized even when they were in the training set (normally, we test with different entries than those in the training set). Surprisingly, this didn't help precision.

We've tried to use the Rainbow classifier, but we couldn't get it to compile. Development seems to have been abandoned in 1998.

In a previous project we had tried the Naive Bayes algorithm with every possible heuristic and feature selection / normalization trick conceivable. We could never achieve the performance we're getting now by combination SVM and Naive Bayes.


  1. All the best blogs that is very useful for keeping me share the ideas
    of the future as well this is really what I was looking for, and I am
    very happy to come here. Thank you very much
    earn to die
    earn to die 2
    earn to die 3
    Hi! I’ve been reading your blog for a while now and finally got the
    earn to die 4
    courage to go ahead and give youu a shout out from
    earn to die 6
    Austin Texas! Just wanted to tell
    earn to die 5
    Hi! I’ve been reading your blog for a while now and finally got the
    happy wheels
    strike force heroes
    you keep up the fantastic work!my weblog
    age of war
    earn to die 5
    good game empire
    tank trouble
    tank trouble 2
    strike force heroes

  2. Today's present day buyer, raised totally with online networking, can even apply for these smaller scale advances utilizing their cell phones, with cash exchanged to effective candidate's ledgers inside minutes.

  3. Money will be accessible to you specifically in your check represent utilize. When you are tie in most exceedingly awful circumstance of monetary emergency and you don't have any wellspring of wage to move from this circumstance, apply for crisis money today.

  4. We can money every kind of checks. want your money checked? we will serve you what you're wanting like check cashing close to American state it's very easy to seek out U.S.A. at your nearest places. Our workplace is opened 24/7. So, you'll be able to notice U.S.A.
    check cashing near me

  5. This opposition between the loan specialists more often than not ensures the most reduced rate. This plan of action is somewhat clear basically as these sites will send an offer for a web credit out to 3-4 banks and make them contend over the business. Payday Loans

  6. Put accentuation on the charges that they request and in addition conceivable extra expenses for different conditions. Check Cashing

  7. . we tend to square measure covering most of the cities and still providing you with the simplest services inside terribly short amount with high notch satisfaction. we are able to trace US 24/7 and you're perpetually welcome.
    check cashing near me

  8. There's prepared money accessible to hold you over the quick budgetary emergency. It is a high intrigue advance, yet well, inasmuch as things are dealt with till your next pay check arrives! Cash Advance