Tuesday, August 17, 2010

Selecting only English-language material when harvesting OAI metadata

We're now harvesting thousands of archives for PhilPapers, as described in my earlier post. But we've stumbled on a new problem which I thought I should report on here.

We only want English-language material on PhilPapers, but a lot of archives won't return language data, or will say that an item is in English when it's not (presumably because it's the default and users don't bother to change it.) This is a serious obstacle to the automatic aggregation of metadata from OAI archives if you don't want your aggregation to be swamped by material your average user will consider pure noise.

Our solution to this problem has three components. First, we weed out archives which don't declare that they have English-language content on OpenDOAR. So we attempt to monitor an archive only if it says that it has material in English among other languages.

Second, we've found that language attributes tend to be truthful at least when they say that an item is not in English, so we weed out anything that is declared as not being in English.

Finally, we apply an automatic language detection test to the rest of the material. This is where it gets tricky.

We originally tried the Language::Guess class on CPAN, but it's not reliable enough.

We've then tried simply checking what percentage of words of an item's title and description are in the standard English dictionary that comes with aspell (the unix program), but there are so many neologisms in philosophy that this excluded many English-language papers.

The final solution is to use aspell in this way, but with an enriched dictionary we compute based on our existing content. Currently we add a word to our dictionary of 'neologisms' just in case it occurs in 10 or more PhilPapers entries which past a strict English-only test. The strict test is to have less than 7% of words not in the standard English dictionary. We need this test because a number of non-English papers have made it into PhilPapers already..

We use aspell because it's supposed to be good at recognizing inflections and the like, and it works well also to provide spelling suggestions (more on this in a later post). However, a note of caution about aspell: all characters in a custom dictionary have to be in the same unicode block, which means they can't contain, say, both French and Polish words with special characters specific to these languages. (This seems like a bug, because the doc only talks about a same-script limitation.) Our solution is to remove diacritics from everything we put in the dictionary. That works for our purposes but could obviously be a major limitation.


  1. tiffany and co jewelry, http://www.tiffanyandco.in.net/
    prada handbags, http://www.pradahandbagsoutlet.co.uk/
    louis vuitton handbags, http://www.louisvuittonhandbag.us/
    tiffany jewellery, http://www.tiffanyjewelleryoutlets.co.uk/
    hermes belt, http://www.hermesbelts.us/
    ugg outlet, http://www.uggsoutlet.us.org/
    karen millen dresses, http://www.karenmillendressesoutlets.co.uk/
    replica watches, http://www.replicawatchesforsale.us.com/
    christian louboutin online, http://www.christianlouboutinonline.us.com/
    canada goose jackets, http://www.uggbootscheap.eu.com/
    gucci, http://www.borseguccioutlet.it/
    air jordan shoes, http://www.airjordanshoes.us.org/
    the north face outlet store, http://www.thenorthfaceoutletstores.org/
    louis vuitton, http://www.louisvuitton.in.net/
    ugg boots, http://www.uggbootsclearance.in.net/
    nobis outlet, http://www.wellensteyn.com.co/
    ray ban sunglasses, http://www.raybansunglassesonline.us.com/
    snow boots, http://www.wintercoats.us.com/
    lebron james shoes, http://www.lebronjames.us.com/
    beats by dr dre, http://www.beatsbydrdre-headphones.us.com/
    michael kors factory outlet, http://www.michaelkorsfactoryoutlets.in.net/
    cheap nhl jerseys, http://www.nhljerseys.us.com/
    montblanc pens, http://www.montblanc-pens.com.co/
    vans shoes, http://www.vans-shoes.cc/
    dansko outlet, http://www.dansko-shoes.us/

  2. شركة الطيب افضل شركة تنظيف كنب بالرياض نقوم بتنظيف الكنب بالرياص بافضل الطرق التي تجعلنا دائما فب المراتب الاولي في مجالات التنظيف عزيزي العميل عليك العلم بان الطيب من افضل شركات التنظيف بالرياض و بالمملكة العربية السعودية انك اذا كنت تريد شركة لتنظيف الكنب الخاص بك فنحن الطيب نقوم بتنظيف الكنب بافضل المواد او بافضل المنظفات التي تعد من اهم الاشياء التي تجعل المهمة ناجحة بنسبة كبيرة فنحن شركة الطيب شركة تنظيف كنب بالرياض نعمل علي تنظيف الكنب بافضل الطرق كما اننا نمتلك افضل طاقم من العمالة المدربة علي اعلي اساس متخيل شركة الطيب الاولي في خدمات التنظيف

  3. I still liked this article, good, good content, and unique design.Thank you for sharing the article.
    obat pengapuran tulang leher herbal
    obat kanker lidah tanpa kemoterapi
    obat laringitis akut herbal

  4. Our office is opened 24/7. So, you can find us all time. United Check Cashing offering you the best services with 100% satisfaction within very short period.check casher

  5. has been a lot of articles on this blog that I keep as a reliable reference. thanks author, the spirit continues to write cara menghilangkan gatal pada kemaluan pria, obat penghilang bintitan

  6. This is the right blog for anyone who wants to find out about this topic. You realize so much its almost hard to argue with you (not that I actually would want…HaHa). You definitely put a new spin on a topic thats been written about for years. Great stuff, just great! obat ginjal bengkak