largecrawl

command

v0.2.44 Latest Latest Go to latest Published: Jul 12, 2023 License: GPL-3.0 Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/miku/metha

Links

Open Source Insights

README ¶

Part of OAI harvest

Archive item: https://archive.org/details/oai_harvest_20230615

Metadata harvested in H1 2023 via metha OAI harvester. Starting point was a list of about 70K OAI-PMH endpoints. URL have been extracted from the raw (XML) metadata.

URLs have been passed to some QA, but seedlist still may contain broken data.

Uploaded seed lists

2023-06-15-metha-url-reduced-no-id-domains.txt
2023-06-15-metha-url-reduced-no-id.txt.zst
2023-06-15-metha-url-reduced-pdf-only-domains.txt
2023-06-15-metha-url-reduced-pdf-only.txt.zst

$ ia upload oai_harvest_20230615 -m collection:ia_biblio_metadata -m mediatype:data -m date:2023-06-15 -m title:"OAI-PMH harvest (2023-06-15)" 2023-06-15-metha-url-reduced-no-id-domains.txt 2023-06-15-metha-url-reduced-no-id.txt 2023-06-15-metha-url-reduced-pdf-only-domains.txt 2023-06-15-metha-url-reduced-pdf-only.txt

Seedlist options

PDF list for a direct crawl; about 4M urls, about 50K domains
full list; 83M urls; 370K domains

Suggesting (1) , with @martin trimming down (2) and run it himself.

Previous Crawl Notes

OAI-PMH-CRAWL-2020-06: "Seedlist size: 31,773,874 Petabox Collection Size: 31.4 TByte PDF Hits: TODO New Unique PDFs: 8,007,344"
OAI-PMH-CRAWL-2022-10: "Seedlist size: 3,662,864 Petabox Collection Size: 5.49 TByte PDF Hits: 2.58 million New Unique PDFs: 1.61 million"

Reporting

No extra reporting; just as mediatype=data: CRL and compressed logs.

Example: OAI-PMH-CRAWL-2020-06;

Collections

Each crawl in a separate collection, under ia_pub_crawls.

Documentation ¶

Overview ¶

genjson extracts info from a stream of OAI DC XML records, e.g.

<record><header>...

<dc:language>eng</dc:language>
<dc:relation>https://ejournal.uksw.edu/ijpna/article/view/1351/731</dc:relation>
<dc:rights xml:lang="en-US">Copyright (c) 2017 Indonesian Journal of Physics and Nuclear Applications</dc:rights>
<dc:rights xml:lang="en-US">http://creativecommons.org/licenses/by-nc-nd/4.0</dc:rights>
</oai_dc:dc>
</metadata><about></about></record>

<record> ...

Run like:

$ sed -e 's@<record>@\n\n\n<record>@' oai.data | python genrecords.py | go run genjson.go

Note that the input does not need to be valid XML, but rather each record element needs to be followed by two lines with only newlines (as separator).

Outputs a converted JSON lines stream to stdout. The JSON will contain parsed issn, url and DOI. Example output:

{
  "oai": "oai:ejournal.uksw.edu:article/1673",
  "datestamp": "2018-05-16T01:48:17Z",
  "sets": [
    "ijpna:ART",
    "driver"
  ],
  "creators": [
    "Sardjono, Yohannes",
    "Kusminarto, Kusminarto",
    "Wusko, Ikna Urwatul"
  ],
  "doi": [
    "10.24246/ijpna.v3i1.29-35"
  ],
  "formats": [
    "application/pdf"
  ],
  "issn": [
    "2550-0570",
    "2549-046X"
  ],
  "ids": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673",
    "10.24246/ijpna.v3i1.29-35"
  ],
  "languages": [
    "eng"
  ],
  "urls": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673"
  ],
  "publishers": [
    "Fakultas Sains dan Matematika Universitas Kristen Satya Wacana"
  ],
  "relations": [
    "https://ejournal.uksw.edu/ijpna/article/view/1673/894"
  ],
  "rights": [
    "Copyright (c) 2018 Indonesian Journal of Physics and Nuclear Applications",
    "http://creativecommons.org/licenses/by/4.0"
  ],
  "titles": [
    "The Optimization of Collimator Material and In Vivo Testing Dosimetry of Boron Neutron Capture Therapy (BNCT) on Radial Piercing Beam Port Kartini Nuclear Reactor by Monte Carlo N-Particle Extended (MCNPX) Simulation Method"
  ],
  "types": [
    "info:eu-repo/semantics/article",
    "info:eu-repo/semantics/publishedVersion",
    "Peer-reviewed Article"
  ]
}

Note: it takes about 5 hours to generate a list of

Package xmlstream implements a lightweight XML scanner on top of encoding/xml. It keeps the flexibility of xml.Unmarshal while allowing the parsing of huge XML files.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL