xPapers

Monday, February 21, 2011

Monday, February 7, 2011

I've just uploaded to CPAN two of the core Perl modules which make xPapers' multi-source ingestion process possible:

Text::Names - contains normalization, parsing and comparison utilities for proper names.
Github: https://github.com/dbourget/Text-Names

Biblio::Citation::Compare - fuzzy comparison of citations
Github: https://github.com/dbourget/Biblio-Citation-Compare

The modules should show up in CPAN search momentarily.

Wednesday, December 15, 2010

Documentation on its way

We are making progress on the documentation of the API. A 'stub' which is already very useful can now be found here.

Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

I'm glad to announce that Wolfgang Schwarz has made his 'OPP-tools' paper harvester and metadata extractor available on Github. This is the software that PhilPaper/xPapers depends on to collect article-level metadata for papers found on personal pages (see this page). Wolfgang has been improving his software and making it re-usable by third parties as part of our JISC-funded project. It's now GPL'd and available here.

Monday, November 15, 2010

Experimental code distribution now available

I have just uploaded a functioning but still experimental version of xPapers on github: https://github.com/xPapers/xPapers

[Update: I should say that this is really not meant for production use yet, and the doc is basically missing.]

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

One of the big challenges of our project was to turn what was essentially designed as a regular database-driven web site into a re-usable application. By 're-usable' here we mean that your typical university sysadmin could take our package and turn it into a PhilPapers-like site that doesn't look too much like PhilPapers. No programming needed, but modifying config files and templates is OK.

The big challenge in here is preserving the upgrade path. PhilPapers is built with HTML::Mason, a template system. We have hundreds of template files of all sizes. If someone copied our source tree and started modifying the templates to adapt them to their needs, they would soon end up with a system that is all but impossible to update with our latest code. A similar problem arises with the many data files and image files that support the site.

Our solution to this problem is to extend the concept of differential programming found in OO programming to templates. Think of our template files as methods of a big Template class. So our header file is a method, our footer file is another, and everything that goes in-between is a method too. A natural way to refine an existing class is to override just the methods you need to change, keeping the originals intact in the superclass. That's what we've done with our templates.

We achieved that by defining several component roots in Mason (the component roots are paths relative to which Mason looks for template files). Suppose we have an incoming request for the /news.html component on PhilPapers. For PhilPapers we have two component roots; let's pretend they are /var/philpapers and /var/xpapers. The latter contains the default templates that ship with xPapers, while the former contains only the overrides required to give PhilPapers its unique look and structure. If template /news.html is requested, Mason first looks in our /var/philpapers/ tree of templates. If there's a /var/philpapers/news.html file, it will use it and ignore /var/xpapers/news.html. If not, it will revert to /var/xpapers/news.html.

Another challenge is maintaining the upgrade path of the database schema. Here we plan to use the same system that we use internally to maintain the schema through git: each change to the schema is saved in a file with the table's name, and we have a script that keeps track of what lines in what files have been executed. That works pretty well, except when we want to roll back some changes. When that happens we simply add more lines which have the effect of rolling back the changes. There's also a theoretical issue about order of execution and foreign key constraints, but in hundreds of updates we have never run into that because we hardly use foreign key constraints with MySQL. If needed we could get around this by adding the constraints to a file whose lines are always executed last. This system isn't as robust as Ruby migrations, but it is lightweight, efficient, and (almost) fun to use.

Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Since PhilPapers' launch, a number of publishers have contacted us asking for some means to submit their content in bulk. We didn't have this facility until recently, partly because it took us a long time to decide what kind of system to implement. Here I report on how we've arrived at the current system.

We started off by trying to find out what kind of system is used by other big consumers of article-level metadata. The idea was that many publishers would already support it. We found out that the biggest consumer of metadata of all, PubMed, uses an XML schema defined by NLM. We also found out that a company which hosts many journals on behalf of publishers, Atypon, supports this system too (one of the first publishers to contact us was with Atypon). So we've decided to implement NLM-style feeds. On this system, we have an FTP server to receive zips of NLM XML files periodically.

This system works OK, but it turns out not to be as widely supported as we had expected. At the same time, the use of PRISM tags in RSS feeds has become fairly common since we've begun this project. Now most publishers have RSS feeds with detailed PRISM tags. So that's the system we now recommend. It's easiest both for us and for publishers. See the details here.

Of course, there's the problem that RSS feeds generally don't include historical data. But there is nothing to stop one from creating a historical RSS channel with all back issues for a journal.

xPapers

Monday, February 21, 2011

Draft manual now available

Monday, February 7, 2011

New Perl modules available

Wednesday, December 15, 2010

Documentation on its way

Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

Monday, November 15, 2010

Experimental code distribution now available

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Contents

Blog Archive

xPapers

Monday, February 21, 2011

Draft manual now available

Monday, February 7, 2011

New Perl modules available

Wednesday, December 15, 2010

Documentation on its way

Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

Monday, November 15, 2010

Experimental code distribution now available

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Subscribe To

Contents

Blog Archive