span-report

command

v0.1.369 Latest Latest Go to latest Published: Jun 3, 2024 License: GPL-3.0 Imports: 13 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/miku/span

Links

Open Source Insights

Documentation ¶

Overview ¶

span-report creates data subsets from an index for reporting.

Example report: For a given collection, find all ISSN it contains and the number of publications in a given interval, e.g. per month.

Collection X ¶

| | 01/18 | 02/18 | 03/18 | 04/18 | ... |------|-------|-------|-------|-------|---- | ISSN | 10 | 12 | 0 | 12 | ... | ISSN | 8 | 9 | 19 | 1 | ... | ISSN | 1 | 2 | 0 | 1 | ...

These results are exported as CSV, TSV or similar, so they can be passed forward into Excel, Pandas or other tools with visualization capabilities.

Expensive pivot query example (1000 issn per collection, might be more, e.g. Springer has over 4000).

q=*:*&wt=json&indent=true&q=*:*&facet.pivot=source_id,mega_collection,issn& facet.pivot=mega_collection,issn&facet=true&facet.field=source_id&facet.limit =1000&rows=0&wt=json&indent=true&facet.pivot.mincount=1

Given a SOLR under load.

Facet (sid, c, issn) with facet.limit 10000, 42M response, takes 5 min.
Facet (sid, c, issn, date) with facet.limit 10000 takes about 2 hours. 1.3G response.
The "fast" report type runs about 240k queries in 2h40mins and could be optimized a bit; it has no limit like facet.limit.
The "faster" report type run 240k queries in 103m35.106s. There is a bit more headroom by batching issns, to reduce local overhead.
A 32 core SOLR can get to a load of 30; span-report will use up to 24 CPUs while SOLR will use mostly six. Around 300 qps, which still seems slow. There are actually two queries per issn (numFound and date faceting, the numFound is fluff). A first run (-w 32 -bs 100) took about 50min.

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL