Tuesday, August 17, 2010

Selecting only English-language material when harvesting OAI metadata

We're now harvesting thousands of archives for PhilPapers, as described in my earlier post. But we've stumbled on a new problem which I thought I should report on here.

We only want English-language material on PhilPapers, but a lot of archives won't return language data, or will say that an item is in English when it's not (presumably because it's the default and users don't bother to change it.) This is a serious obstacle to the automatic aggregation of metadata from OAI archives if you don't want your aggregation to be swamped by material your average user will consider pure noise.

Our solution to this problem has three components. First, we weed out archives which don't declare that they have English-language content on OpenDOAR. So we attempt to monitor an archive only if it says that it has material in English among other languages.

Second, we've found that language attributes tend to be truthful at least when they say that an item is not in English, so we weed out anything that is declared as not being in English.

Finally, we apply an automatic language detection test to the rest of the material. This is where it gets tricky.

We originally tried the Language::Guess class on CPAN, but it's not reliable enough.

We've then tried simply checking what percentage of words of an item's title and description are in the standard English dictionary that comes with aspell (the unix program), but there are so many neologisms in philosophy that this excluded many English-language papers.

The final solution is to use aspell in this way, but with an enriched dictionary we compute based on our existing content. Currently we add a word to our dictionary of 'neologisms' just in case it occurs in 10 or more PhilPapers entries which past a strict English-only test. The strict test is to have less than 7% of words not in the standard English dictionary. We need this test because a number of non-English papers have made it into PhilPapers already..

We use aspell because it's supposed to be good at recognizing inflections and the like, and it works well also to provide spelling suggestions (more on this in a later post). However, a note of caution about aspell: all characters in a custom dictionary have to be in the same unicode block, which means they can't contain, say, both French and Polish words with special characters specific to these languages. (This seems like a bug, because the doc only talks about a same-script limitation.) Our solution is to remove diacritics from everything we put in the dictionary. That works for our purposes but could obviously be a major limitation.


  1. شركة الطيب افضل شركة تنظيف كنب بالرياض نقوم بتنظيف الكنب بالرياص بافضل الطرق التي تجعلنا دائما فب المراتب الاولي في مجالات التنظيف عزيزي العميل عليك العلم بان الطيب من افضل شركات التنظيف بالرياض و بالمملكة العربية السعودية انك اذا كنت تريد شركة لتنظيف الكنب الخاص بك فنحن الطيب نقوم بتنظيف الكنب بافضل المواد او بافضل المنظفات التي تعد من اهم الاشياء التي تجعل المهمة ناجحة بنسبة كبيرة فنحن شركة الطيب شركة تنظيف كنب بالرياض نعمل علي تنظيف الكنب بافضل الطرق كما اننا نمتلك افضل طاقم من العمالة المدربة علي اعلي اساس متخيل شركة الطيب الاولي في خدمات التنظيف

  2. I still liked this article, good, good content, and unique design.Thank you for sharing the article.
    obat pengapuran tulang leher herbal
    obat kanker lidah tanpa kemoterapi
    obat laringitis akut herbal

  3. Our office is opened 24/7. So, you can find us all time. United Check Cashing offering you the best services with 100% satisfaction within very short period.check casher

  4. has been a lot of articles on this blog that I keep as a reliable reference. thanks author, the spirit continues to write cara menghilangkan gatal pada kemaluan pria, obat penghilang bintitan

  5. This is the right blog for anyone who wants to find out about this topic. You realize so much its almost hard to argue with you (not that I actually would want…HaHa). You definitely put a new spin on a topic thats been written about for years. Great stuff, just great! obat ginjal bengkak

  6. However, when you put aside a piece of your next paycheque to pay off your credit, you're probably going to be left short again toward the finish of the month - in this way prompting what is frequently alluded to as the "payday advance trap" or the "payday advance cycle". Payday Loans San-diego

  7. "Welcome to the web experts treatment of various types of diseases, do not forget to get the best health solutions here"
    Cara Mengobati Beri Beri

  8. Really endure in mind United States and that we can money your test terribly brief amount with a virtually low charge.
    check cashing Bridgeport

  9. given article is very helpful and very useful for my admin, and pardon me permission to share articles here hopefully helped Cara mengatasi ginjal bocor

  10. the best services about http://www.unitedcheckcashing.com within a short period. In USA you may find us everywhere, every city and 24/7. We really love to serve you the best things with 100% satisfactions.
    check cashing places near me

  11. nice post friend , Thank you for sharing with us, and we sincerely hope you will continue to update or post other articles
    cara untuk menghentikan pendarahan saat haid yang berlebihan

  12. very interesting information, I really like it because it can add insight for me more broadly, thank you very much for this extraordinary information
    4 gejala diabetes pada wanita yang harus diwaspadai,cara alternatif mengecilkan pada dan betis besar

  13. By taking the time to read a lot of information like this to add my insight . cara menggugurkan kandungan

  14. 이렇게 헤어져야 한대도 나는 늘 이 자리에 있을게 이게 우리의 마지막 이래도 맘속엔 늘 내가 자릴 지킬게 Cara Melancarkan Haid Obat Pilek Cara Mengatasi Gejala Tipes

  15. Demo Mode is a great way|a good way|an effective way} of exploring new titles to see occasion that they} match your style. Live Casino - True table game lovers ought to do themselves justice and experience reside on line casino gaming. Your typical online on line casino in Korea might be 원 엑스 벳 provided by reside on line casino giants such as NetEnt or Evolution Gaming. Table Games - Slots could also be} the most well-liked on line casino game in the marketplace, but a Korean on line casino can’t be complete utilizing a|with no} good set of table games. Indeed, those that prefer games of talent can have plenty to look forward to|look forward to|sit up for} within the online world.