Wednesday, December 15, 2010

Documentation on its way

We are making progress on the documentation of the API. A 'stub' which is already very useful can now be found here.

Saturday, November 20, 2010

Paper harvester and metadata extractor now available on github

I'm glad to announce that Wolfgang Schwarz has made his 'OPP-tools' paper harvester and metadata extractor available on Github. This is the software that PhilPaper/xPapers depends on to collect article-level metadata for papers found on personal pages (see this page). Wolfgang has been improving his software and making it re-usable by third parties as part of our JISC-funded project. It's now GPL'd and available here.

Monday, November 15, 2010

Experimental code distribution now available

I have just uploaded a functioning but still experimental version of xPapers on github: https://github.com/xPapers/xPapers

[Update: I should say that this is really not meant for production use yet, and the doc is basically missing.]

Tuesday, November 9, 2010

How to make a web site re-usable by third parties

One of the big challenges of our project was to turn what was essentially designed as a regular database-driven web site into a re-usable application. By 're-usable' here we mean that your typical university sysadmin could take our package and turn it into a PhilPapers-like site that doesn't look too much like PhilPapers. No programming needed, but modifying config files and templates is OK.

The big challenge in here is preserving the upgrade path. PhilPapers is built with HTML::Mason, a template system. We have hundreds of template files of all sizes. If someone copied our source tree and started modifying the templates to adapt them to their needs, they would soon end up with a system that is all but impossible to update with our latest code. A similar problem arises with the many data files and image files that support the site.

Our solution to this problem is to extend the concept of differential programming found in OO programming to templates. Think of our template files as methods of a big Template class. So our header file is a method, our footer file is another, and everything that goes in-between is a method too. A natural way to refine an existing class is to override just the methods you need to change, keeping the originals intact in the superclass. That's what we've done with our templates.

We achieved that by defining several component roots in Mason (the component roots are paths relative to which Mason looks for template files). Suppose we have an incoming request for the /news.html component on PhilPapers. For PhilPapers we have two component roots; let's pretend they are /var/philpapers and /var/xpapers. The latter contains the default templates that ship with xPapers, while the former contains only the overrides required to give PhilPapers its unique look and structure. If template /news.html is requested, Mason first looks in our /var/philpapers/ tree of templates. If there's a /var/philpapers/news.html file, it will use it and ignore /var/xpapers/news.html. If not, it will revert to /var/xpapers/news.html.

Another challenge is maintaining the upgrade path of the database schema. Here we plan to use the same system that we use internally to maintain the schema through git: each change to the schema is saved in a file with the table's name, and we have a script that keeps track of what lines in what files have been executed. That works pretty well, except when we want to roll back some changes. When that happens we simply add more lines which have the effect of rolling back the changes. There's also a theoretical issue about order of execution and foreign key constraints, but in hundreds of updates we have never run into that because we hardly use foreign key constraints with MySQL. If needed we could get around this by adding the constraints to a file whose lines are always executed last. This system isn't as robust as Ruby migrations, but it is lightweight, efficient, and (almost) fun to use.



Wednesday, August 18, 2010

Bulk ingestion protocols, which is best?

Since PhilPapers' launch, a number of publishers have contacted us asking for some means to submit their content in bulk. We didn't have this facility until recently, partly because it took us a long time to decide what kind of system to implement. Here I report on how we've arrived at the current system.

We started off by trying to find out what kind of system is used by other big consumers of article-level metadata. The idea was that many publishers would already support it. We found out that the biggest consumer of metadata of all, PubMed, uses an XML schema defined by NLM. We also found out that a company which hosts many journals on behalf of publishers, Atypon, supports this system too (one of the first publishers to contact us was with Atypon). So we've decided to implement NLM-style feeds. On this system, we have an FTP server to receive zips of NLM XML files periodically.

This system works OK, but it turns out not to be as widely supported as we had expected. At the same time, the use of PRISM tags in RSS feeds has become fairly common since we've begun this project. Now most publishers have RSS feeds with detailed PRISM tags. So that's the system we now recommend. It's easiest both for us and for publishers. See the details here.

Of course, there's the problem that RSS feeds generally don't include historical data. But there is nothing to stop one from creating a historical RSS channel with all back issues for a journal.

Tuesday, August 17, 2010

Selecting only English-language material when harvesting OAI metadata

We're now harvesting thousands of archives for PhilPapers, as described in my earlier post. But we've stumbled on a new problem which I thought I should report on here.

We only want English-language material on PhilPapers, but a lot of archives won't return language data, or will say that an item is in English when it's not (presumably because it's the default and users don't bother to change it.) This is a serious obstacle to the automatic aggregation of metadata from OAI archives if you don't want your aggregation to be swamped by material your average user will consider pure noise.

Our solution to this problem has three components. First, we weed out archives which don't declare that they have English-language content on OpenDOAR. So we attempt to monitor an archive only if it says that it has material in English among other languages.

Second, we've found that language attributes tend to be truthful at least when they say that an item is not in English, so we weed out anything that is declared as not being in English.

Finally, we apply an automatic language detection test to the rest of the material. This is where it gets tricky.

We originally tried the Language::Guess class on CPAN, but it's not reliable enough.

We've then tried simply checking what percentage of words of an item's title and description are in the standard English dictionary that comes with aspell (the unix program), but there are so many neologisms in philosophy that this excluded many English-language papers.

The final solution is to use aspell in this way, but with an enriched dictionary we compute based on our existing content. Currently we add a word to our dictionary of 'neologisms' just in case it occurs in 10 or more PhilPapers entries which past a strict English-only test. The strict test is to have less than 7% of words not in the standard English dictionary. We need this test because a number of non-English papers have made it into PhilPapers already..

We use aspell because it's supposed to be good at recognizing inflections and the like, and it works well also to provide spelling suggestions (more on this in a later post). However, a note of caution about aspell: all characters in a custom dictionary have to be in the same unicode block, which means they can't contain, say, both French and Polish words with special characters specific to these languages. (This seems like a bug, because the doc only talks about a same-script limitation.) Our solution is to remove diacritics from everything we put in the dictionary. That works for our purposes but could obviously be a major limitation.

Friday, June 25, 2010

Implementing file upload progress bar for the new PhilPapers

At first glance this seems like a trivial thing to do - you periodically observe how much of the file you've got and update the progress bar - so it did not look like much of work when I was assigned the task to replace the hodgepodge of technologies that provided that feature at the previous version of PhilPapers website. But a quick research of available Open Source solutions revealed that there aren't that many of them and none fitting our choice of technologies and in particular ones that don't use Flash - so bad lack I had to rewrite our own from scratch. This lack of ready-available libraries hints about the difficulties of that seemingly trivial task. I will describe here our solution - not because I think it is an optimal one - but to start a discussion of what could be such a solution.

First of all the common programming tools (like CGI.pm or Mason that we use here) assume that the page handler receives the whole request as input - and that whole request is not available until after the file is uploaded. So for example 'my $q = CGI.pm->new' will not finish until it is too late to measure the upload progress. The solution to that is to use another page to report the upload progress and call that page via Ajax from Javascript code updating the progress bar. This would work great - but the file is normally uploaded to a temporary file with a random name and the other script would not have any chance to guess it. We need to generate a new random file name in the form page and then pass that name to the form handler script so that it would save the data to that file, and in parallel to the Ajax scripts that would check the size of that file.


To save the data into a specified filename I used the CGI.pm callback feature:

my $q = CGI->new( \&hook, $fh, undef );
...
sub hook {
my ($filename, $buffer, $bytes_read, $fh) = @_;
print $fh substr($buffer, 0, $bytes_read);
$fh->flush();
}

It is described in the subsection called "Progress bars for file uploads and avoiding temp files" of the CGI.pm documentaion, but actually it is a great leap of thought to say that it supports progress bar implementation, you still cannot use it directly to get the progress bar from the CGI object on the form landing page, you still need the separate scripts measuring the progress. For my solution all I needed was to pass the target file name to the code saving the data, this could be easier than writing this callback above. And the callback is still not everything - I yet need a way to pass the generated filename from the form page to that script - and not via form parameters, remember they are not available at that stage. So how can that be done? Simple - as PATH_INFO - which is available in the %ENV hash even before the params are parsed by CGI.pm.

This is the skeleton of the solution - there are a few more details in the actual implementation - but the code will be published soon as Open Source - so I hope everyone will be able to look them up there.

Thursday, April 8, 2010

Syndicating content from institutional repositories

Institutional repositories powered by software like EPrints and D-Space hold lots of good research content. How do academics potentially interested in this content find it? Obviously, if you're looking for literature on a research topic, say the nature of dispositional properties, you're not going to visit every single IR to do a local search on "dispositional properties". You'll likely do one of three things:
  1. Google "dispositional properties".
  2. Search Google Scholar, Web of Science, or some other generic research index for "dispositional properties".
  3. Search a relevant subject repository for "dispositional properties". In this case PhilPapers would serve you well. Try searching for "dispositional properties" on PP. Not only do you get tons of highly relevant content, but you get a link to a bibliography on dispositional properties maintained by an expert on the subject.
I know based on my logs that a lot of people go for option (3) a lot of the time. PP will generally give you more relevant results than options (1) and (2), and it will often turn up papers that are not even indexed by Scholar or Web of Science & co. That's what it's designed to do.

But my aim here is not to brag about PP's virtues. It's to point out that the content held by IRs can only reach its intended audience if it's properly indexed, and it can only have a maximum impact if it is properly indexed by SRs like PP. But indexing IRs along subject-specific lines is not trivial.

The challenge

How can SRs extract only the content that's relevant to their subject from IRs? Some suggestions:
  1. Ask content producers (academics) to submit their metadata to SRs.
  2. Harvest all content from all IRs, and filter out irrelevant content based on keywords or more advanced document classification techniques.
  3. Crowd-source a list of IRs and OAI sets relevant to your subject.
Of course, all three solutions can be pursued in parallel, but it's worth asking how much content one can expect to get through each of them.

I wouldn't expect much content from (1). Most academics barely have enough energy to submit their papers to IRs; don't ask them to submit them to relevant SRs on top of that. What they could do is submit only to SRs, but this wouldn't be a solution to the problem of syndicating IRs. (Incidentally, we've seen a drastic reduction in philosophy content coming from the few other archives we've been harvesting since PP's launch, so I think the transition to submitting directly to SRs is happening, but there's still a lot of content in IRs we'd like to dig up.)

(2) can't be relied on in isolation. It's a major document classification challenge. However, a lot of papers in IRs have relevant dc:subject attributes. We've tried filtering content from IRs based on the occurrence of the word 'philosophy' in dc:subject attributes. We think that this classification method has a high precision and decent (but far from perfect) recall. I'll post some data shortly.

(3) is also far from perfect, if only because many IRs don't have relevant OAI sets. But a number of them do (for example, the philosophy department has its own set). When that's the case, that's definitely the technique to use.

Our solution

In the end, the solution we've settled for on PP is a combination of all three approaches. We've created an "archive registry" which we've initially populated with all archives listed in OpenDOAR. The registry is not public at the time of writing, but it will soon be. We're currently in the process of filtering out any clearly irrelevant archive (most are).

When we harvest an archive, we first check if it has sets matching certain keywords (our "subject keywords"). If it does, we harvest all and only the contents of these sets. If it doesn't, we harvest everything and filter by matching dc:subject attributes against our subject keywords.

The public registry comes handy when a user notices that their IR-hosted papers aren't making it into PP as they would expect. In this case they can add or edit their IR as appropriate (or they can get their IR manager to do that). Users have the ability to force either set-harvesting or keyword-filtering behaviour. They can also set up a more complex hybrid process combining set- and keyword-filtering. We will bring this system online a.s.a.p and adjust as we see fit based on the feedback we get.

Residual issues

There are other challenges aside from subject-specific harvesting.

One is broken archives. A lot of archives are broken, or have broken content. The most common problem is invalid HTML embedded in the metadata. This can be a serious problem, because a single item can block the harvesting process. If we see that too many archives are stuck like this we'll have to start sending out notices to the admin addresses.

Another issue is that archive metadata are generally incomplete as far as their machine-readable content goes. Typically, either we can't extract publication details or they are missing. That's a major limitation for a site like PP which makes extensive use of such information. I can't think of an easy fix to that. The solution would be good citation extraction tech. I'm not aware of any sufficiently reliable, publicly available algorithm.


A definition of "subject repository"

A discussion on the JISC conference site prompted me to suggest a definition of "subject repository" which I hope might be useful in discussing the place of services like PhilPapers:
A subject repository is a repository of research outputs (and possibly metadata about such outputs) whose primary mission is to give end users access to all and only the research content available in a given subject.
For the purposes of this definition, a subject can be as vast as the mathematical sciences or as narrow as any dissertation topic.

Of course, some services we're inclined to label as subject repositories impose subject-independent constraints on their content (e.g. open access availability). That's why I inserted the "primary mission" qualification--this allows that some SRs have auxilarity requirements which impose some limitations on what content they aggregate.

Thursday, April 1, 2010

Automatic categorization of citations using Perl

I'm glad to announce that we've made tremendous progress on the automatic categorization of PhilPapers entries. We have developed a categorizer which can assign area-level categories to 40% of entries with 98% accuracy. That is, 40% of entries are categorized and 60% are left uncategorized; of all the area-level categories assigned, 98% are correct. We doubt it would be possible to get better performance given all the noise in our training set (it's not like the humans who did the initial categorization are infallible and always aiming for accurate categorization).

We plan to put the categorizer into production shortly after Easter. Half the currently uncategorized entries (about 80,000) should be assigned areas, and half the new items coming into the index should automatically be assigned areas in the future. We also hope that we will be able to increase recall (the number of items categorized) while keeping precision above 95% as our training set improves. Our training set currently has about 120,000 entries and 30 categories.

The algorithm

We've combined two classification techniques: Naive Bayes and Support Vector Machines. An entry gets assigned to a category just in case our separate Bayesian and SVM categorizers assign it to that category. This tends to bias classification towards false negatives, which is exactly what we want in our case. In all our tests, Naive Bayes performs at least slightly better than SVM, but not well enough precision-wise for our purposes.

Our combined classifier works on author names, titles, descriptors, and abstracts, as well as editor names and collection titles for collections and journal names for journal articles.

We've found that feature selection based on a test of independence instead of mere frequency significantly improves performance. We currently use the Χ2 test for this purpose. We retain only words which have a minimum Χ2 value we've determined through trial and error (6000 at the moment).

We've also found that certain feature transformations are essential to attain optimal performance. We transform author names so that "John Smith" becomes a single word: "xxJohnxxSmithxx". This distinguishes names from other orthographically identical words and insures that classification is based on full name matches rather than mere firstname or lastname matches. We also transform journal names in the same way.

Implementation

We use the AI::Categorizer framework available on CPAN. This framework allowed us to test a range of classification algorithms, feature selection methods, and normalization techniques. While we're glad we've decided to use the framework, we've had to fix a few bugs in it and we've often been frustrated by the lack of documentation. It's not very polished, and some things don't work as one would expect. Hopefully that's going to improve in future releases (it's only at version 0.09 after all; we're going to submit our patches to the maintainer). We're going to release our customized classes soon, but if anyone is interested, email me.

One definite virtue of AI::Categorizer is that the SVM and Naive Bayes categorizer that come with it have excellent default settings. We've played with many different setting combinations (described below), but the defaults turned out best aside from the changes described above and some custom feature weighting we've introduced. Our feature weighting is as follows:

title: 1.5
abstract: 1
journal: 0.5
authors: 0.5
collection title: 0.5
editors: 0.5

While these settings seemed to improve performance, the difference was not always clearly significant.

For stop words, we use the list provided by the Lingua::StopWords package.

We currently use a probability threshold of 0.9 for Naive Bayes.

Χ2 feature selection is done with a patched version of AI::Categorizer::FeatureSelector::ChiSquare.

What else we've tried

We've tried the b,t,n,f,p and c feature weighting flags provided by AI::Categorizer and none helped, either individually or in combination. Some bugs with some of the flags resulted in divisions by zero. We've patched that.

We've tried the polynomial, and radial kernels for SVM, but the default (linear) works best.

We've tried the KNN and DecisionTree classification algorithms, but neither managed to complete the training with more than 10% of our training set (we ran out of memory on a 2GB VM). Either the algorithms or the AI::Categorizer implementations are not sufficiently efficient. Their precision was also worse than Naive Bayes and SVM with small training sets.

We've tried purifying our training set by removing from it all items which could not be successfully categorized even when they were in the training set (normally, we test with different entries than those in the training set). Surprisingly, this didn't help precision.

We've tried to use the Rainbow classifier, but we couldn't get it to compile. Development seems to have been abandoned in 1998.

In a previous project we had tried the Naive Bayes algorithm with every possible heuristic and feature selection / normalization trick conceivable. We could never achieve the performance we're getting now by combination SVM and Naive Bayes.

Sunday, March 28, 2010

Facilitating access to subscription-based resources -- Athens, Shibboleth, OpenURL, Reverse Proxy, etc

Australian and North American PhilPapers users have often told me how much they like its off-campus access feature. Off-campus access on PhilPapers works like this. First, users configure their institutional reverse proxy in their account. From that point on, PhilPapers will point them to the proxy for subscription-based articles. This will be transparent to the users. They might be asked for their credentials by the proxy, but aside from that they will have direct access to the papers. At the time of writing, 513 users have configured a proxy.

This system works well for our North American and Australian users because reverse proxies are widely used in North America and Australia. Not so in Europe. The UK, in particular, has its own subscription management systems called Athens and Shibboleth. Athens is the old system but remains more standard as far as I can tell, and the two systems are identical from the end-user point of view. These systems don't use proxies. When a user wants to access an article through Athens, they have to go first to the publisher's page. Then they have to click a link from that page to obtain an Athens login page specific to the publisher. Often they will have to look up their institution on Athens' site before logging in. After having logged in, they will be taken back to the publisher's site and authorized to download papers from the publisher. As far as I can tell, users have to repeat this process for every publisher they visit, though the credentials ought to be remembered between visits to the same publisher. We don't have Athens accounts at SAS so I couldn't test this, but I can't see how it could be otherwise --- surely publisher sites do not constantly query Athens' servers to check if their visitors' IPs have already been authenticated, whether through embedded Javascript or backend connections (I can confirm there is no such Javascript on Wiley's site). So there is a lot of clicking around for UK users when they're browsing papers from various sources.

I would like to facilitate things for our UK users. In theory, we could forward our users to an Athens URL containing the final URL of the resource the user wants to access and the user's institution, i.e. something like this:

http://www.athensams.net?u=FINAL_URL&inst=University%20of%20XYZ

Athens could easily look up the relevant publisher based on the submitted URL, then a) forward the user directly to FINAL_URL if they are already authenticated or b) present the user with a login page before forwarding them. This would make Athens as easy to use as reverse proxy for PhilPapers users.

But this was not to be. The actual login URLs for Athens look something like this:

https://auth.athensams.net/?ath_returl=FINAL_URL&ath_dspid=WILEY

These are the URLs one finds on publishers' sites (Wiley in this case). As far as I can tell, there is no parameter to specify the institution, and the ath_dspid code is mandatory. When I tried changing the latter for 'OTHER' the login failed (I found someone with an Athens account who tested this for me).

So, to point our users to appropriate Athens login pages we would have to know the publisher codes. And of course they are not published by Eduserv, the company that runs Athens. Eduserv has refused to help us out in any way -- all we've ever heard from them is 'buy our product'.

Another company which hasn't been very cooperative is ExLibris, the company that makes the widely used SFX OpenURL resolver. Here too we'd need some data to make use of the service, because each institution has its own OpenURL resolver. I've repeatedly asked ExLibris for a list of institutional SFX server URLs (I'm sure they have that), and I never got any response at all.

Fortunately for us, WorldCat has come up with a solution to our problem (and everyone else trying to streamline access to subscription-based resources): they've created a big database of institutional OpenURL resolvers which everyone can query for free (so long as it's for non-profit use). Given an institution's name, WorldCat's 'Registry' will tell us what resolver(s) it uses. The resolver will then forward the user to an appropriate access point to the item, including (as far as I can tell) Athens access pages as appropriate. Thanks to WorldCat, we can give our users the benefits of all the secret data the like of ExLibris and Eduserv have at almost no expense to ourselves.

Friday, March 26, 2010

A new backup script with rsync, versioning and rotation

Here is a backup script that does what every backup script should do:
  1. Use rsync or equivalent for transfer to speed things up
  2. Keep rotating versions of backups
  3. Don't duplicate files unless needed on the backup host (using hard links)
  4. Can be configured from a text file
One minor issue with this script, however, is that it runs from the backed up machine instead of the backup host (you don't want an intruder to gain access to your backup, ideally). But I made the script to use with EVBackup's service. While I trust them not to tamper with my encrypted data, I wouldn't trust them with the keys to PhilPapers's production server.

The script should run on any *nix machine with rsync. It should work with any *nix backup host as well, but you will need shell access to the backup host. You will also need to configure the user running the script for password-less login.