Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Since PhilPapers' launch, a number of publishers have contacted us asking for some means to submit their content in bulk. We didn't have this facility until recently, partly because it took us a long time to decide what kind of system to implement. Here I report on how we've arrived at the current system.

We started off by trying to find out what kind of system is used by other big consumers of article-level metadata. The idea was that many publishers would already support it. We found out that the biggest consumer of metadata of all, PubMed, uses an XML schema defined by NLM. We also found out that a company which hosts many journals on behalf of publishers, Atypon, supports this system too (one of the first publishers to contact us was with Atypon). So we've decided to implement NLM-style feeds. On this system, we have an FTP server to receive zips of NLM XML files periodically.

This system works OK, but it turns out not to be as widely supported as we had expected. At the same time, the use of PRISM tags in RSS feeds has become fairly common since we've begun this project. Now most publishers have RSS feeds with detailed PRISM tags. So that's the system we now recommend. It's easiest both for us and for publishers. See the details here.

Of course, there's the problem that RSS feeds generally don't include historical data. But there is nothing to stop one from creating a historical RSS channel with all back issues for a journal.

Tuesday, August 17, 2010

Selecting only English-language material when harvesting OAI metadata

We're now harvesting thousands of archives for PhilPapers, as described in my earlier post. But we've stumbled on a new problem which I thought I should report on here.

We only want English-language material on PhilPapers, but a lot of archives won't return language data, or will say that an item is in English when it's not (presumably because it's the default and users don't bother to change it.) This is a serious obstacle to the automatic aggregation of metadata from OAI archives if you don't want your aggregation to be swamped by material your average user will consider pure noise.

Our solution to this problem has three components. First, we weed out archives which don't declare that they have English-language content on OpenDOAR. So we attempt to monitor an archive only if it says that it has material in English among other languages.

Second, we've found that language attributes tend to be truthful at least when they say that an item is not in English, so we weed out anything that is declared as not being in English.

Finally, we apply an automatic language detection test to the rest of the material. This is where it gets tricky.

We originally tried the Language::Guess class on CPAN, but it's not reliable enough.

We've then tried simply checking what percentage of words of an item's title and description are in the standard English dictionary that comes with aspell (the unix program), but there are so many neologisms in philosophy that this excluded many English-language papers.

The final solution is to use aspell in this way, but with an enriched dictionary we compute based on our existing content. Currently we add a word to our dictionary of 'neologisms' just in case it occurs in 10 or more PhilPapers entries which past a strict English-only test. The strict test is to have less than 7% of words not in the standard English dictionary. We need this test because a number of non-English papers have made it into PhilPapers already..

We use aspell because it's supposed to be good at recognizing inflections and the like, and it works well also to provide spelling suggestions (more on this in a later post). However, a note of caution about aspell: all characters in a custom dictionary have to be in the same unicode block, which means they can't contain, say, both French and Polish words with special characters specific to these languages. (This seems like a bug, because the doc only talks about a same-script limitation.) Our solution is to remove diacritics from everything we put in the dictionary. That works for our purposes but could obviously be a major limitation.