Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

I'm glad to announce that Wolfgang Schwarz has made his 'OPP-tools' paper harvester and metadata extractor available on Github. This is the software that PhilPaper/xPapers depends on to collect article-level metadata for papers found on personal pages (see this page). Wolfgang has been improving his software and making it re-usable by third parties as part of our JISC-funded project. It's now GPL'd and available here.

Monday, November 15, 2010

Experimental code distribution now available

I have just uploaded a functioning but still experimental version of xPapers on github: https://github.com/xPapers/xPapers

[Update: I should say that this is really not meant for production use yet, and the doc is basically missing.]

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

One of the big challenges of our project was to turn what was essentially designed as a regular database-driven web site into a re-usable application. By 're-usable' here we mean that your typical university sysadmin could take our package and turn it into a PhilPapers-like site that doesn't look too much like PhilPapers. No programming needed, but modifying config files and templates is OK.

The big challenge in here is preserving the upgrade path. PhilPapers is built with HTML::Mason, a template system. We have hundreds of template files of all sizes. If someone copied our source tree and started modifying the templates to adapt them to their needs, they would soon end up with a system that is all but impossible to update with our latest code. A similar problem arises with the many data files and image files that support the site.

Our solution to this problem is to extend the concept of differential programming found in OO programming to templates. Think of our template files as methods of a big Template class. So our header file is a method, our footer file is another, and everything that goes in-between is a method too. A natural way to refine an existing class is to override just the methods you need to change, keeping the originals intact in the superclass. That's what we've done with our templates.

We achieved that by defining several component roots in Mason (the component roots are paths relative to which Mason looks for template files). Suppose we have an incoming request for the /news.html component on PhilPapers. For PhilPapers we have two component roots; let's pretend they are /var/philpapers and /var/xpapers. The latter contains the default templates that ship with xPapers, while the former contains only the overrides required to give PhilPapers its unique look and structure. If template /news.html is requested, Mason first looks in our /var/philpapers/ tree of templates. If there's a /var/philpapers/news.html file, it will use it and ignore /var/xpapers/news.html. If not, it will revert to /var/xpapers/news.html.

Another challenge is maintaining the upgrade path of the database schema. Here we plan to use the same system that we use internally to maintain the schema through git: each change to the schema is saved in a file with the table's name, and we have a script that keeps track of what lines in what files have been executed. That works pretty well, except when we want to roll back some changes. When that happens we simply add more lines which have the effect of rolling back the changes. There's also a theoretical issue about order of execution and foreign key constraints, but in hundreds of updates we have never run into that because we hardly use foreign key constraints with MySQL. If needed we could get around this by adding the constraints to a file whose lines are always executed last. This system isn't as robust as Ruby migrations, but it is lightweight, efficient, and (almost) fun to use.