span-report

command
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2024 License: GPL-3.0 Imports: 13 Imported by: 0

Documentation

Overview

span-report creates data subsets from an index for reporting.

Example report: For a given collection, find all ISSN it contains and the number of publications in a given interval, e.g. per month.

Collection X

| | 01/18 | 02/18 | 03/18 | 04/18 | ... |------|-------|-------|-------|-------|---- | ISSN | 10 | 12 | 0 | 12 | ... | ISSN | 8 | 9 | 19 | 1 | ... | ISSN | 1 | 2 | 0 | 1 | ...

These results are exported as CSV, TSV or similar, so they can be passed forward into Excel, Pandas or other tools with visualization capabilities.

Expensive pivot query example (1000 issn per collection, might be more, e.g. Springer has over 4000).

q=*:*&wt=json&indent=true&q=*:*&facet.pivot=source_id,mega_collection,issn& facet.pivot=mega_collection,issn&facet=true&facet.field=source_id&facet.limit =1000&rows=0&wt=json&indent=true&facet.pivot.mincount=1

Given a SOLR under load.

  • Facet (sid, c, issn) with facet.limit 10000, 42M response, takes 5 min.
  • Facet (sid, c, issn, date) with facet.limit 10000 takes about 2 hours. 1.3G response.
  • The "fast" report type runs about 240k queries in 2h40mins and could be optimized a bit; it has no limit like facet.limit.
  • The "faster" report type run 240k queries in 103m35.106s. There is a bit more headroom by batching issns, to reduce local overhead.
  • A 32 core SOLR can get to a load of 30; span-report will use up to 24 CPUs while SOLR will use mostly six. Around 300 qps, which still seems slow. There are actually two queries per issn (numFound and date faceting, the numFound is fluff). A first run (-w 32 -bs 100) took about 50min.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL