Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Since PhilPapers' launch, a number of publishers have contacted us asking for some means to submit their content in bulk. We didn't have this facility until recently, partly because it took us a long time to decide what kind of system to implement. Here I report on how we've arrived at the current system.

We started off by trying to find out what kind of system is used by other big consumers of article-level metadata. The idea was that many publishers would already support it. We found out that the biggest consumer of metadata of all, PubMed, uses an XML schema defined by NLM. We also found out that a company which hosts many journals on behalf of publishers, Atypon, supports this system too (one of the first publishers to contact us was with Atypon). So we've decided to implement NLM-style feeds. On this system, we have an FTP server to receive zips of NLM XML files periodically.

This system works OK, but it turns out not to be as widely supported as we had expected. At the same time, the use of PRISM tags in RSS feeds has become fairly common since we've begun this project. Now most publishers have RSS feeds with detailed PRISM tags. So that's the system we now recommend. It's easiest both for us and for publishers. See the details here.

Of course, there's the problem that RSS feeds generally don't include historical data. But there is nothing to stop one from creating a historical RSS channel with all back issues for a journal.