Documentation ¶
Overview ¶
Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications.
https://pkp.sfu.ca/2015/10/23/introducing-the-pkp-index/
Usage:
$ make $ ./pkpindex
Output will json lines (oai endpoint is guessed):
{ "name": "Scholarly and Research Communication", "homepage": "http://src-online.ca/index.php/src", "oai": "http://src-online.ca/index.php/src/oai" } { "name": "Stream: Culture/Politics/Technology", "homepage": "http://journals.sfu.ca/stream/index.php/stream", "oai": "http://journals.sfu.ca/stream/index.php/stream/oai" }
Additional ideas:
* check, if journal site is part of a bigger installation (move path element up and pattern match).
Notes.
Index page will not yield 404 on invalid page, so max page needs to be set manually for now. Pagination seems to require more, maybe cookies.
Pagination is broken, direct link, with custom UA, cookie ends always ends up at first page; probably a bit too much JS.
Fetch each journal info page, e.g. https://index.pkp.sfu.ca/index.php/browse/archiveInfo/5421 - non-existent pages will redirect to homepage, but not via HTTP 3XX, but via "refresh" header (http://www.otsukare.info/2015/03/26/refresh-http-header).
Certainly, a site with character.
<div id="content"> <h3>Revista de Psicologia del Deporte</h3> <p class="archiveLinks"><a href="https://index.pkp.sfu.ca/index.php/browse/index/37">Browse Records</a> | <a href="http://rpd-online.com" target="_blank">Journal Website</a> | <a href="http://rpd-online.com/issue/current" target="_blank">Current Issue</a> | <a href="http://rpd-online.com/issue/archive" target="_blank">All Issues</a></p>
Let's https://github.com/ericchiang/pup
cat page-000281.html | pup 'h3 text{}' # Journal of Modern Materials cat page-000281.html | pup 'p.archiveLinks > a:nth-child(2) attr{href}' # https://journals.aijr.in/index.php/jmm/index