I'm glad to announce that we've made tremendous progress on the automatic categorization of PhilPapers entries. We have developed a categorizer which can assign area-level categories to 40% of entries with 98% accuracy. That is, 40% of entries are categorized and 60% are left uncategorized; of all the area-level categories assigned, 98% are correct. We doubt it would be possible to get better performance given all the noise in our training set (it's not like the humans who did the initial categorization are infallible and always aiming for accurate categorization).
We plan to put the categorizer into production shortly after Easter. Half the currently uncategorized entries (about 80,000) should be assigned areas, and half the new items coming into the index should automatically be assigned areas in the future. We also hope that we will be able to increase recall (the number of items categorized) while keeping precision above 95% as our training set improves. Our training set currently has about 120,000 entries and 30 categories.
The algorithm
We've combined two classification techniques: Naive Bayes and Support Vector Machines. An entry gets assigned to a category just in case our separate Bayesian and SVM categorizers assign it to that category. This tends to bias classification towards false negatives, which is exactly what we want in our case. In all our tests, Naive Bayes performs at least slightly better than SVM, but not well enough precision-wise for our purposes.
Our combined classifier works on author names, titles, descriptors, and abstracts, as well as editor names and collection titles for collections and journal names for journal articles.
We've found that feature selection based on a test of independence instead of mere frequency significantly improves performance. We currently use the Χ2 test for this purpose. We retain only words which have a minimum Χ2 value we've determined through trial and error (6000 at the moment).
We've also found that certain feature transformations are essential to attain optimal performance. We transform author names so that "John Smith" becomes a single word: "xxJohnxxSmithxx". This distinguishes names from other orthographically identical words and insures that classification is based on full name matches rather than mere firstname or lastname matches. We also transform journal names in the same way.
Implementation
We use the AI::Categorizer framework available on CPAN. This framework allowed us to test a range of classification algorithms, feature selection methods, and normalization techniques. While we're glad we've decided to use the framework, we've had to fix a few bugs in it and we've often been frustrated by the lack of documentation. It's not very polished, and some things don't work as one would expect. Hopefully that's going to improve in future releases (it's only at version 0.09 after all; we're going to submit our patches to the maintainer). We're going to release our customized classes soon, but if anyone is interested, email me.
One definite virtue of AI::Categorizer is that the SVM and Naive Bayes categorizer that come with it have excellent default settings. We've played with many different setting combinations (described below), but the defaults turned out best aside from the changes described above and some custom feature weighting we've introduced. Our feature weighting is as follows:
title: 1.5
abstract: 1
journal: 0.5
authors: 0.5
collection title: 0.5
editors: 0.5
While these settings seemed to improve performance, the difference was not always clearly significant.
For stop words, we use the list provided by the Lingua::StopWords package.
We currently use a probability threshold of 0.9 for Naive Bayes.
Χ2 feature selection is done with a patched version of AI::Categorizer::FeatureSelector::ChiSquare.
What else we've tried
We've tried the b,t,n,f,p and c feature weighting flags provided by AI::Categorizer and none helped, either individually or in combination. Some bugs with some of the flags resulted in divisions by zero. We've patched that.
We've tried the polynomial, and radial kernels for SVM, but the default (linear) works best.
We've tried the KNN and DecisionTree classification algorithms, but neither managed to complete the training with more than 10% of our training set (we ran out of memory on a 2GB VM). Either the algorithms or the AI::Categorizer implementations are not sufficiently efficient. Their precision was also worse than Naive Bayes and SVM with small training sets.
We've tried purifying our training set by removing from it all items which could not be successfully categorized even when they were in the training set (normally, we test with different entries than those in the training set). Surprisingly, this didn't help precision.
We've tried to use the Rainbow classifier, but we couldn't get it to compile. Development seems to have been abandoned in 1998.
In a previous project we had tried the Naive Bayes algorithm with every possible heuristic and feature selection / normalization trick conceivable. We could never achieve the performance we're getting now by combination SVM and Naive Bayes.
All the best blogs that is very useful for keeping me share the ideas
ReplyDeleteof the future as well this is really what I was looking for, and I am
very happy to come here. Thank you very much
earn to die
earn to die 2
earn to die 3
Hi! I’ve been reading your blog for a while now and finally got the
earn to die 4
courage to go ahead and give youu a shout out from
earn to die 6
Austin Texas! Just wanted to tell
earn to die 5
Hi! I’ve been reading your blog for a while now and finally got the
happy wheels
strike force heroes
slitherio
you keep up the fantastic work!my weblog
age of war
earn to die 5
good game empire
tank trouble
tank trouble 2
strike force heroes
chaussures christian louboutin
ReplyDeletelouis vuitton belt
coach outlet online
gucci outlet online
chanel bags
replica watches
true religion sale
kate spade outlet
ray ban sunglasses
nike tn pas cher
chenlina20170317
Today's present day buyer, raised totally with online networking, can even apply for these smaller scale advances utilizing their cell phones, with cash exchanged to effective candidate's ledgers inside minutes. www.usacheckcashingstore.com/chicago
ReplyDeleteMoney will be accessible to you specifically in your check represent utilize. When you are tie in most exceedingly awful circumstance of monetary emergency and you don't have any wellspring of wage to move from this circumstance, apply for crisis money today. usacheckcashingstore.com/san-diego
ReplyDeleteThis opposition between the loan specialists more often than not ensures the most reduced rate. This plan of action is somewhat clear basically as these sites will send an offer for a web credit out to 3-4 banks and make them contend over the business. Payday Loans
ReplyDeletePut accentuation on the charges that they request and in addition conceivable extra expenses for different conditions. Check Cashing
ReplyDelete. we tend to square measure covering most of the cities and still providing you with the simplest services inside terribly short amount with high notch satisfaction. we are able to trace US 24/7 and you're perpetually welcome.
ReplyDeletecheck cashing near me
There's prepared money accessible to hold you over the quick budgetary emergency. It is a high intrigue advance, yet well, inasmuch as things are dealt with till your next pay check arrives! Cash Advance
ReplyDeletenice post friend , Thank you for sharing with us, and we sincerely hope you will continue to update or post other articles
ReplyDeleteobat fibroid (miom) herbal yang aman untuk ibu hamil
شركة نقل اثاث من الرياض الى الامارات
ReplyDeleteشركة نقل اثاث من الرياض الى قطر
شركة نقل اثاث من الرياض الى الاردن
شركة نقل عفش بالرياض رخيصه
شركة نقل عفش بجدة
Take every chance you get in life, because some things only happen once
ReplyDeletecara mengatasi darah haid berwarna coklat kehitaman
شركة تسليك مجاري بالرياض
ReplyDeleteشركة عزل خزانات بالرياض
كشف تسربات بابها
شركة تسليك مجاري بابها
كشف تسربات المياة بالدمام
شركة تسليك مجاري بالدمام
شركة البيت الراقي تقدم افضل كشف تسربات المياة بأحدث الألات والمعدات الخاصة بكشف تسربات المياة وبدون اي نوع من انواع التكسير وفي اسرع وقت ممكن كما انة توفر ادوات لتصليح المواسير من الشقوق والكسور وبأفضل الاسعار الممكنة كما اننا نعتمد في القيام بعملية كشف تسربات المياة علي علي افضل عمال متدربين وذو خبرة عالية في العديد من المجالات
nice post friend , Thank you for sharing with us, and we sincerely hope you will continue to update or post other articles
ReplyDeletecara menyembuhkan diabetes kering dan basah secara alami,agen walatra gamat emas kapsul,
rolex replica watches for sale
ReplyDeletediscount oakley sunglasses
kate spade outlet store
nike air max 97
ralph lauren sale
yeezy boost 350
nba jerseys wholesale
kate spade handbags
polo ralph lauren
canada goose jackets
cc0426
Take every chance you get in life, because some things only happen once
ReplyDeletecara menghilangkan lemak berlebihan di leher dan dagu berlipat
ReplyDeletehttps://khalejmovers.com
شركة شحن عفش من الرياض الى الاردن
ارخص شركة نقل العفش بجدة
افضل شركة نقل العفش بجدة
Such an amazing and helpful post this is. I really really love it. It's so good and so awesome. I am just amazed. I hope that you continue to do your work like this in the future also. See Instagram photos and videos from 'Pictaram' hashtag.
ReplyDelete
ReplyDeleteThe sexy women we have with us at our agency at Call Girls in Delhi has a unique set of sex talents for their clients. They might even strip your clothes off and give you the best blowjob that you have been longing for! Check our other Services...
Female Call Girls in Delhi
Call Girls in Agra
Escorts Service in Agra
Russian Escorts in Faridabad
Russian Call Girls in Jaipur
THANK YOU FOR VISITING MY WEBSITE:-
ReplyDeleterussian escorts in gurgaon
housewife escorts in gurgaon
gurgaon escort services
gurgaon escorts
escorts in gurgaon
escort services in gurgaon
gurgaon call girls
call girls in gurgaon
independent escorts gurgaon
Can you use راهنمای خرید this plugin to add مجله ریما files to the user’s account (i.e., can I, as the admin, go in and upload a PDF file to a specific user’s account, and perhaps have ضد فراماسونری that user get some notification that a file has been uploaded اسمارت مگ to their account?).
ReplyDeleteThanks for the information on this. I really enjoy the information. google
ReplyDelete