largecrawl

command
v0.2.44 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 12, 2023 License: GPL-3.0 Imports: 16 Imported by: 0

README

Part of OAI harvest

Archive item: https://archive.org/details/oai_harvest_20230615

Metadata harvested in H1 2023 via metha OAI harvester. Starting point was a list of about 70K OAI-PMH endpoints. URL have been extracted from the raw (XML) metadata.

URLs have been passed to some QA, but seedlist still may contain broken data.

Uploaded seed lists

  • 2023-06-15-metha-url-reduced-no-id-domains.txt
  • 2023-06-15-metha-url-reduced-no-id.txt.zst
  • 2023-06-15-metha-url-reduced-pdf-only-domains.txt
  • 2023-06-15-metha-url-reduced-pdf-only.txt.zst
$ ia upload oai_harvest_20230615 -m collection:ia_biblio_metadata -m mediatype:data -m date:2023-06-15 -m title:"OAI-PMH harvest (2023-06-15)" 2023-06-15-metha-url-reduced-no-id-domains.txt 2023-06-15-metha-url-reduced-no-id.txt 2023-06-15-metha-url-reduced-pdf-only-domains.txt 2023-06-15-metha-url-reduced-pdf-only.txt

Seedlist options

  1. PDF list for a direct crawl; about 4M urls, about 50K domains
  2. full list; 83M urls; 370K domains

Suggesting (1) , with @martin trimming down (2) and run it himself.

Previous Crawl Notes

  • OAI-PMH-CRAWL-2020-06: "Seedlist size: 31,773,874 Petabox Collection Size: 31.4 TByte PDF Hits: TODO New Unique PDFs: 8,007,344"
  • OAI-PMH-CRAWL-2022-10: "Seedlist size: 3,662,864 Petabox Collection Size: 5.49 TByte PDF Hits: 2.58 million New Unique PDFs: 1.61 million"

Reporting

No extra reporting; just as mediatype=data: CRL and compressed logs.

Example: OAI-PMH-CRAWL-2020-06;

Collections

Each crawl in a separate collection, under ia_pub_crawls.

Documentation

Overview

genjson extracts info from a stream of OAI DC XML records, e.g.

<record><header>...

<dc:language>eng</dc:language>
<dc:relation>https://ejournal.uksw.edu/ijpna/article/view/1351/731</dc:relation>
<dc:rights xml:lang="en-US">Copyright (c) 2017 Indonesian Journal of Physics and Nuclear Applications</dc:rights>
<dc:rights xml:lang="en-US">http://creativecommons.org/licenses/by-nc-nd/4.0</dc:rights>
</oai_dc:dc>
</metadata><about></about></record>

<record> ...

Run like:

$ sed -e 's@<record>@\n\n\n<record>@' oai.data | python genrecords.py | go run genjson.go

Note that the input does not need to be valid XML, but rather each record element needs to be followed by two lines with only newlines (as separator).

Outputs a converted JSON lines stream to stdout. The JSON will contain parsed issn, url and DOI. Example output:

{
  "oai": "oai:ejournal.uksw.edu:article/1673",
  "datestamp": "2018-05-16T01:48:17Z",
  "sets": [
    "ijpna:ART",
    "driver"
  ],
  "creators": [
    "Sardjono, Yohannes",
    "Kusminarto, Kusminarto",
    "Wusko, Ikna Urwatul"
  ],
  "doi": [
    "10.24246/ijpna.v3i1.29-35"
  ],
  "formats": [
    "application/pdf"
  ],
  "issn": [
    "2550-0570",
    "2549-046X"
  ],
  "ids": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673",
    "10.24246/ijpna.v3i1.29-35"
  ],
  "languages": [
    "eng"
  ],
  "urls": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673"
  ],
  "publishers": [
    "Fakultas Sains dan Matematika Universitas Kristen Satya Wacana"
  ],
  "relations": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673/894"
  ],
  "rights": [
    "Copyright (c) 2018 Indonesian Journal of Physics and Nuclear Applications",
    "http://creativecommons.org/licenses/by/4.0"
  ],
  "titles": [
    "The Optimization of Collimator Material and In Vivo Testing Dosimetry of Boron Neutron Capture Therapy (BNCT) on Radial Piercing Beam Port Kartini Nuclear Reactor by Monte Carlo N-Particle Extended (MCNPX) Simulation Method"
  ],
  "types": [
    "info:eu-repo/semantics/article",
    "info:eu-repo/semantics/publishedVersion",
    "Peer-reviewed Article"
  ]
}

Note: it takes about 5 hours to generate a list of

Package xmlstream implements a lightweight XML scanner on top of encoding/xml. It keeps the flexibility of xml.Unmarshal while allowing the parsing of huge XML files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL