xPapers is a web portal framework for disciplinary virtual research environments. It comes with tools for:

  • indexing research content from journals, personal pages, OAI archives, and libraries.
  • searching and monitoring the indexed content
  • locally archiving articles (it's a full OAI archive)
  • implementing event announcements, discussion spaces, and many other VRE services

xPapers is distributed through git hub.

The best way to discover xPapers is to try PhilPapers, our prototype implementation.

You may also be interested in opp-tools, a standalone component of the xPapers platform used to monitor and extract metadata and citations from papers found on personal web sites. opp-tools is distributed and documented independently.

Friday, December 16, 2011

Work in progress on MySQL table lock problem

The table called 'main' contains the core bibliographic information for xPapers entries. Right now this is a MyISAM table because we need fulltext indexes on it and InnoDB isn't fast enough for other queries we run on this table. In the past we've tried using InnoDB and putting the fulltext-searchable fields in a separate table but the result wasn't good; the joins were too costly.

PhilPapers has been running up against the limitations of MyISAM for a while now. The problem is that updates lock the entire table on MyISAM. With 15,000 users a day at peak, many of whom update bibliographic data, PhilPapers is getting too busy for this architecture.

The solution I currently prefer is to use two copies of the main table: one for authenticated users (who can update it), and a read-only copy for anonymous users. The read-only copy would be updated periodically, maybe every hour or so. The lag wouldn't normally be apparent because selects made by authenticated users would be routed to the master copy they are updating.

I hope that splitting the load this way is going to help a lot because only 16% of search queries on PhilPapers are made by authenticated users. This strategy should eliminate 84% of query clashes. The strategy could also be extended by keeping track of which users have recently updated the main table and only routing those to the master. In this way I think we'd eliminate virtually all clashes. We could also route all queries of a certain type to the read-only copy.

One remaining question is how to do the copying. I've considered using MySQL replication, but it's not clear that it's possible to replicate tables/databases within a single server, and I'm reluctant to introduce a dependency on another machine/VM. Anyway, since replication is almost synchronous this wouldn't solve the problem: the replication updates would cause as many locks on the 'read-only' copy as the user updates cause on the master copy. The approach I favour at this point is to rebuild the read-only table like this on a periodic basis:

create table main_ro_tmp like main;
insert into main_ro_tmp select * from main where not deleted;
drop table main_ro;
rename table main_ro_tmp to main_ro;

I use a temporary table to build the new version because the insert statement currently takes about 4-5 minutes due to all the indexes on this table. It would be a little faster to create the indexes after having inserted the data, but I don't think the complication is worth the code complexification (I couldn't use "like main" anymore to automatically track whatever the index config on main is).

Ideally they're be a way of blocking selects on main_ro between statements 3 and 4, but "lock tables main_ro write, main_ro_tmp write" doesn't work: MySQL complains that there's a transaction in process. Looks like renaming is incompatible with locks. Any suggestions appreciated, though I think statement 4 will execute fast enough it's not a major concern.

Monday, February 7, 2011

New Perl modules available

I've just uploaded to CPAN two of the core Perl modules which make xPapers' multi-source ingestion process possible:

Text::Names - contains normalization, parsing and comparison utilities for proper names.
Github: https://github.com/dbourget/Text-Names

Biblio::Citation::Compare - fuzzy comparison of citations
Github: https://github.com/dbourget/Biblio-Citation-Compare

The modules should show up in CPAN search momentarily.

Wednesday, December 15, 2010

Documentation on its way

We are making progress on the documentation of the API. A 'stub' which is already very useful can now be found here.

Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

I'm glad to announce that Wolfgang Schwarz has made his 'OPP-tools' paper harvester and metadata extractor available on Github. This is the software that PhilPaper/xPapers depends on to collect article-level metadata for papers found on personal pages (see this page). Wolfgang has been improving his software and making it re-usable by third parties as part of our JISC-funded project. It's now GPL'd and available here.

Monday, November 15, 2010

Experimental code distribution now available

I have just uploaded a functioning but still experimental version of xPapers on github: https://github.com/xPapers/xPapers

[Update: I should say that this is really not meant for production use yet, and the doc is basically missing.]

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

One of the big challenges of our project was to turn what was essentially designed as a regular database-driven web site into a re-usable application. By 're-usable' here we mean that your typical university sysadmin could take our package and turn it into a PhilPapers-like site that doesn't look too much like PhilPapers. No programming needed, but modifying config files and templates is OK.

The big challenge in here is preserving the upgrade path. PhilPapers is built with HTML::Mason, a template system. We have hundreds of template files of all sizes. If someone copied our source tree and started modifying the templates to adapt them to their needs, they would soon end up with a system that is all but impossible to update with our latest code. A similar problem arises with the many data files and image files that support the site.

Our solution to this problem is to extend the concept of differential programming found in OO programming to templates. Think of our template files as methods of a big Template class. So our header file is a method, our footer file is another, and everything that goes in-between is a method too. A natural way to refine an existing class is to override just the methods you need to change, keeping the originals intact in the superclass. That's what we've done with our templates.

We achieved that by defining several component roots in Mason (the component roots are paths relative to which Mason looks for template files). Suppose we have an incoming request for the /news.html component on PhilPapers. For PhilPapers we have two component roots; let's pretend they are /var/philpapers and /var/xpapers. The latter contains the default templates that ship with xPapers, while the former contains only the overrides required to give PhilPapers its unique look and structure. If template /news.html is requested, Mason first looks in our /var/philpapers/ tree of templates. If there's a /var/philpapers/news.html file, it will use it and ignore /var/xpapers/news.html. If not, it will revert to /var/xpapers/news.html.

Another challenge is maintaining the upgrade path of the database schema. Here we plan to use the same system that we use internally to maintain the schema through git: each change to the schema is saved in a file with the table's name, and we have a script that keeps track of what lines in what files have been executed. That works pretty well, except when we want to roll back some changes. When that happens we simply add more lines which have the effect of rolling back the changes. There's also a theoretical issue about order of execution and foreign key constraints, but in hundreds of updates we have never run into that because we hardly use foreign key constraints with MySQL. If needed we could get around this by adding the constraints to a file whose lines are always executed last. This system isn't as robust as Ruby migrations, but it is lightweight, efficient, and (almost) fun to use.