pkpindex

command
v0.3.16 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 6, 2024 License: GPL-3.0 Imports: 15 Imported by: 0

README

PKP Journal info

NOTE: As of 2022-01-01 PKP is not maintained any more.

https://pkp.sfu.ca/2021/10/05/pkp-index-retiring-as-of-january-1-2022/

$ make
$ ./pkpindex

Output will json lines (oai endpoint is guessed):

{
  "name": "Scholarly and Research Communication",
  "homepage": "http://src-online.ca/index.php/src",
  "oai": "http://src-online.ca/index.php/src/oai"
}
{
  "name": "Stream: Culture/Politics/Technology",
  "homepage": "http://journals.sfu.ca/stream/index.php/stream",
  "oai": "http://journals.sfu.ca/stream/index.php/stream/oai"
}

Additional ideas:

  • check, if journal site is part of a bigger installation (move path element up and pattern match).

Documentation

Overview

Small util to get journal info from https://index.pkp.sfu.ca currently including 1264043 records indexed from 4960 publications.

https://pkp.sfu.ca/2015/10/23/introducing-the-pkp-index/

Usage:

$ make $ ./pkpindex

Output will json lines (oai endpoint is guessed):

{
  "name": "Scholarly and Research Communication",
  "homepage": "http://src-online.ca/index.php/src",
  "oai": "http://src-online.ca/index.php/src/oai"
}
{
  "name": "Stream: Culture/Politics/Technology",
  "homepage": "http://journals.sfu.ca/stream/index.php/stream",
  "oai": "http://journals.sfu.ca/stream/index.php/stream/oai"
}

Additional ideas:

* check, if journal site is part of a bigger installation (move path element up and pattern match).

Notes.

Index page will not yield 404 on invalid page, so max page needs to be set manually for now. Pagination seems to require more, maybe cookies.

Pagination is broken, direct link, with custom UA, cookie ends always ends up at first page; probably a bit too much JS.

Fetch each journal info page, e.g. https://index.pkp.sfu.ca/index.php/browse/archiveInfo/5421 - non-existent pages will redirect to homepage, but not via HTTP 3XX, but via "refresh" header (http://www.otsukare.info/2015/03/26/refresh-http-header).

Certainly, a site with character.

<div id="content"> <h3>Revista de Psicologia del Deporte</h3> <p class="archiveLinks"><a href="https://index.pkp.sfu.ca/index.php/browse/index/37">Browse Records</a>&nbsp;&nbsp;|&nbsp;&nbsp;<a href="http://rpd-online.com" target="_blank">Journal Website</a>&nbsp;&nbsp;|&nbsp;&nbsp;<a href="http://rpd-online.com/issue/current" target="_blank">Current Issue</a>&nbsp;&nbsp;|&nbsp;&nbsp;<a href="http://rpd-online.com/issue/archive" target="_blank">All Issues</a></p>

Let's https://github.com/ericchiang/pup

cat page-000281.html | pup 'h3 text{}' # Journal of Modern Materials cat page-000281.html | pup 'p.archiveLinks > a:nth-child(2) attr{href}' # https://journals.aijr.in/index.php/jmm/index

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL